# Input

Hi ChatGPT, I’m working on my Week 8 Instructor-led Lab for my graduate Python course (BGEN 632 - UM MSBA). 
I am not looking for final answers. I want help understanding how to approach the assignment step-by-step. Provde a step by Step using a different hypothetical question without giving me answers so I can figute it out myself.
Can you walk me through how I should structure the logic using pseudo-code or examples—without giving me the full solution? I’ll write my own code based on what I learn. # Instructor-led Lab: Advanced Data Manipulation

In this assignment you will practice your new skills in data manipulation with the *piping* expression for `pandas`. Please follow the instructions below.

## Sorting and Ordering Data

Last week you sorted, ordered, and filtered the data [github_teams.csv](/data/github_teams.csv) using basic pandas techniques. Now that you have learned to use advanced techniques relying on piping expressions, you will perform the similar operations again, but with your new skills. 

Please perform the following operations in order:

* Select the columns `Team_type`, `human_work`, and `work_per_human`.
* Select columns that end in the letter `t`. Use the regex `t$`.
* Sort your data descending using the columns `Team_size_class`, `human_work`, `work_per_human`.
* Select `human-bot` teams that have a `bot_members_count` value greater than and equal to 3.
* Find the `human` teams that are `Large` and have a `human_gini` value greater than and equal to 0.75.
* How many teams are in the `Small` or `Large` category?
* How many teams are in the `Small` or `Large` category with a `human_gini` value less than and equal to 0.25?
* How many `human` teams are in the `Medium` category?
* Save the columns `Team_size_class` and `work_per_human` as a new DataFrame.
* Rename these two columns in the new DataFrame. Change `human_gini` to `work_inequality` and `eval_survival_day_median` to `issue_resolution_time`.

## Dataset Contents

| Variable  | Description | 
| :------------ |:---------------|
| name_h | Alphanumeric hash generated to anonymize repo name |
| Team_type |  A qualitative descriptor indicating whether the team was made of only humans or a blend of humans and bots (two levels: human, human-bot) |
| Team_size_class |  A qualitative descriptor indicating the size of the team, derived from human_members_count (three levels: small [2, 3], medium [4, 6], large [7, 246]) |
| human_members_count  | The number of human users in the repo |
| bot_members_count | The number of bots in the repo |
| human_work | The number of work events generated by humans in the repo |
| work_per_human | The ratio of works events to humans, derived from human_members_count and human_work |
| human_gini | Gini coefficient for human work in the repo |
| human_Push | The number of push events generated by humans in the repo |
| human_IssueComments | The number of issue comment events generated by humans in the repo |
| human_PRReviewComment | The number of pull request review comment events generated by humans in the repo |
| human_MergedPR | The number of merged pull request events generated by humans in the repo |
| bot_work | The number of work events generated by bots in the repo |
| bot_Push | The number of push events generated by bots in the repo |
| bot_IssueComments | The number of issue comment events generated by bots in the repo |
| bot_PRReviewComment | The number of pull request review comment events generated by bots in the repo |
| bot_MergedPR | The number of merged pull request events generated by bots in the repo |
| eval_survival_day_median | The median number of days that an issue remained open in the repo (teams who were not included in issue survival analysis have NA value) |
| issues_count | The number of issues in the repo |


# Output

Absolutely, Michelle. I love how you’re trying to understand the logic rather than just plugging in answers—let’s walk through how to approach this lab like a pro using a parallel, hypothetical example that mirrors each of the tasks from your GitHub teams lab.

⸻

🔧 Setup (Hypothetical Data)

Let’s say you’re working with a dataset called project_teams.csv with columns like:
	•	Project_type: “internal”, “external”
	•	Hours_logged: total hours logged by humans
	•	hours_per_person: ratio of hours per person
	•	Team_size_category: “Small”, “Medium”, “Large”
	•	ai_members_count: number of AI agents
	•	human_diversity: Gini coefficient (like human_gini)
	•	resolution_days_median: median time to resolve issues

⸻

🧠 Step-by-Step Pseudo-Logic with Parallel Thinking

⸻

🔹 Task 1: Select Specific Columns

🧩 How do I pull out only the few columns I want to analyze?

Pseudocode:

(df
 .filter(["Project_type", "Hours_logged", "hours_per_person"])
)

Think: “Use .filter([...]) to extract columns just like select() in R.”

⸻

🔹 Task 2: Select Columns That End in ‘t’

🧩 What if I want columns with names ending in a certain character?

Pseudocode:

(df
 .filter(regex="t$")
)

Think: Use regular expressions with .filter(regex="pattern") to dynamically select columns.

⸻

🔹 Task 3: Sort Data Descending by Multiple Columns

🧩 How do I sort based on multiple fields, in reverse order?

Pseudocode:

(df
 .sort_values(by=["Team_size_category", "Hours_logged", "hours_per_person"], ascending=False)
)

Think: .sort_values() lets you order the rows. The list of column names tells pandas how to sort.

⸻

🔹 Task 4: Filter by Two Conditions

🧩 How do I find all external projects with ≥ 3 AI members?

Pseudocode:

(df
 .query("Project_type == 'external' and ai_members_count >= 3")
)

Think: .query() is like SQL WHERE. Combine multiple filters with and or or.

⸻

🔹 Task 5: Filter by Categorical and Numeric Condition

🧩 How do I get internal, large teams with high diversity inequality?

Pseudocode:

(df
 .query("Project_type == 'internal' and Team_size_category == 'Large' and human_diversity >= 0.75")
)

Think: You can use strings for categories and numeric comparisons in one expression.

⸻

🔹 Task 6: Count Rows for Multiple Categories

🧩 How many projects are Small or Large?

Pseudocode:

(df
 .query("Team_size_category in ['Small', 'Large']")
 .shape[0]
)

Think: .shape[0] gives you the number of rows (i.e., teams) after filtering.

⸻

🔹 Task 7: Add Extra Condition on Numeric Column

🧩 Now only count if those same teams have low diversity.

Pseudocode:

(df
 .query("Team_size_category in ['Small', 'Large'] and human_diversity <= 0.25")
 .shape[0]
)

Think: Add an additional filter to the same .query() chain.

⸻

🔹 Task 8: Count Based on Category + Another Condition

🧩 How many internal, medium teams exist?

Pseudocode:

(df
 .query("Project_type == 'internal' and Team_size_category == 'Medium'")
 .shape[0]
)

Think: Filter down by combining values across two columns.

⸻

🔹 Task 9: Create New DataFrame With Selected Columns

🧩 I want to reuse only size and hours-per-person for new analysis.

Pseudocode:

new_df = df.filter(["Team_size_category", "hours_per_person"])

Think: Save a slimmed-down version of the dataset to a new variable.

⸻

🔹 Task 10: Rename Columns

🧩 How do I rename columns for better reporting/plotting?

Pseudocode:

renamed_df = (new_df
              .rename(columns={
                  "hours_per_person": "efficiency_score"
              }))

Think: Use a dictionary inside .rename(columns={...}) to swap old → new names.

If you’re renaming from the full dataset:

renamed_df = (df
              .filter(["Team_size_category", "hours_per_person", "resolution_days_median"])
              .rename(columns={
                  "hours_per_person": "efficiency_score",
                  "resolution_days_median": "issue_resolution_time"
              }))



⸻

🧰 Final Advice

To implement this:
	1.	Use .filter() for columns and .query() for rows.
	2.	Use parentheses for piping across multiple method calls.
	3.	Test one line at a time before chaining more logic.
	4.	Comment your code so you remember what each block does.
	5.	Validate your assumptions by using .head(), .info(), and .shape.

⸻

Would you like me to review your pseudocode before you test it on github_teams.csv?