# ChatGPT Conversation

# Input:

Hi ChatGPT, I’m working on my Week 7 Instructor-led Lab for my graduate Python course (BGEN 632 - UM MSBA). 
I am not looking for final answers. I want help understanding how to approach the assignment step-by-step.
Can you walk me through how I should structure the logic using pseudo-code or examples—without giving me the full solution? I’ll write my own code based on what I learn. 

In this assignment you will practice utilizing arrays and/or lists in a program. # Instructor-led Lab: Manipulating Data

In this assignment you will practice utilizing DataFrames in a program. 

For this lab, you will use the [github_teams.csv](/data/github_team.csv) which contains behavioral trace data extracted from GitHub.

## Accessing Data

Using the `github_teams` dataset, please perform the following operations in order:

* Open the file within Python.
* Find out what the column header names are.
* Determine the number of columns.
* Determine the number of rows.
* Determine which columns are categorical and convert them from *object* to *category*.
* How many unique values does `Team_type` have?
* How many unique values does `Team_size_class` have?
* What is the value of the 63rd row and 6th column?
* What are the values for the 300th row?
* Using three different methods, select row with index value 595 with 1st, 2nd, 3rd columns.
* Using two different methods, select the row with index value 46 with the 3rd and 7th columns.
* Create a new DataFrame for the column `bot_work` using two different methods.

## Sorting and Ordering data 

Now that you have learned to subsample data, it is your turn to apply your new knowledge. Using the `github_teams` dataset, please perform the following operations in order:

* Select `human-bot` teams that have a `bot_members_count` value greater than and equal to 2.
* Find the `human` teams that are `Large` and have a `human_gini` value greater than and equal to 0.75.
* How many teams are in the `Small` or `Large` category?
* How many teams are in the `Small` or `Large` cateogry with a `human_gini` value less than and equal to 0.20?
* How many `human-bot` teams are in the `Medium` category?
* Create a subsample of 50% of your data.
* Create samples for a 8-fold cross validation test.
* Select columns that are numeric and save it as a new DataFrame.
* Remove the columns `bot_PRReviewComment` and `bot_MergedPR` from the DataFrame.
* Save the columns `Team_size_class` and `human_members_count` as a new DataFrame.
* Rename these two columns in the new DataFrame.

Save your notebook with output displayed within it and submit for grading.


# Output:

You’re not looking for answers—just a step-by-step breakdown of how to think through and structure the logic. Let’s walk through that using plain English + light pseudo-code to guide your own implementation.

⸻

🧠 Big Picture

You’re working with the github_teams.csv file. The lab is broken into two sections:

Part 1: Accessing & Inspecting Data

Goal: Load the data and understand its structure.

Part 2: Sorting & Subsetting

Goal: Filter, sample, and manipulate the DataFrame.

⸻

🔧 Pseudo-code / Approach: Section by Section

⸻

🟩 1. Accessing the Data

Step-by-step:

# Step 1: Import libraries
import pandas as pd
import os  # if needed to set your working directory

# Step 2: Load the dataset
df = pd.read_csv('path/to/github_teams.csv')



⸻

✅ Next Tasks:

Think of each bullet point as a small test case:
	•	✅ Find the column names → df.columns
	•	✅ Find the number of rows/cols → df.shape
	•	✅ Find categorical columns:

df.dtypes   # See which ones are 'object'
df['column_name'] = df['column_name'].astype('category')


	•	✅ Unique values:

df['Team_type'].nunique()
df['Team_size_class'].nunique()


	•	✅ Access specific data:

df.iloc[62, 5]  # 63rd row, 6th column
df.iloc[299]    # 300th row


	•	✅ Select rows & columns in multiple ways:
Think .iloc, .loc, and slicing or name-based selection.
	•	✅ Create new DataFrame from one column:
Use pd.DataFrame(df['bot_work']) or df[['bot_work']]

⸻

🟨 2. Sorting and Subsetting Data

This section tests your ability to filter and slice based on conditions.

Examples to mentally walk through:
	•	✅ Filter with multiple conditions:

df[(df['Team_type'] == 'human-bot') & (df['bot_members_count'] >= 2)]


	•	✅ Filter with isin():

df[df['Team_size_class'].isin(['Small', 'Large'])]


	•	✅ Create a subsample (random sample):

df.sample(frac=0.5, random_state=1)


	•	✅ Use KFold cross-validation (structure only):

from sklearn.model_selection import KFold
kf = KFold(n_splits=8)
for train_idx, test_idx in kf.split(df):
    # Store or print indices


	•	✅ Get only numeric columns:

df.select_dtypes(include='number')


	•	✅ Drop columns:

df.drop(['col1', 'col2'], axis=1)


	•	✅ Create new DataFrame with selected columns & rename:

new_df = df[['Team_size_class', 'human_members_count']].copy()
new_df.columns = ['TeamSizeCategory', 'HumanMembers']



⸻

🧩 How to Structure Your Notebook

In your Jupyter notebook, follow this flow:
	1.	🔷 Markdown: Add section headers (e.g., ## Step 1: Importing and Loading Data)
	2.	🧪 Code cells: Use simple, testable snippets (1-2 lines at a time)
	3.	📓 Markdown below each code cell: Explain what you’re doing and why
	4.	✅ Output displayed: Confirm each block does what you expect
	5.	📚 References section at the bottom

⸻

✅ Summary Checklist

Task	Approach
Load data	pd.read_csv()
Inspect structure	df.shape, df.columns, df.dtypes
Convert types	astype('category')
Access rows/cols	iloc, loc, df[]
Filter rows	Use Boolean conditions
Random samples	df.sample()
Cross-validation	Use KFold
Subset/rename cols	Select columns, then rename() or change .columns



⸻

Let me know when you’re ready to test parts of your code—I’ll help you troubleshoot or expand them!