# Analyzing Data Sets

In [1]:
!pwd

/Users/katiezhang/Desktop/Matriculate Data


In [2]:
mat_csv = 'MAT Performance Task Data Sheet.csv'

In [3]:
import pandas as pd

mat_df = pd.read_csv(mat_csv)

mat_df.head()

Unnamed: 0,Status,First Generation,Ethnicity,GPA,SAT,PSAT,ACT,How Did You Hear About Matriculate?,Extracurricular Activities,Program Type,FAFSA Status
0,Active,No,Asian,4.28,,1500.0,,A peer,Louisville Youth Orchestra (LYO)\nLYO’s Presto...,CollegePoint,80-Completed
1,Active,Yes,Asian,4.0,1500.0,1390.0,,A peer,Stuyvesant Red Cross\nStuyvesant Road Runners\...,CollegePoint,80-Completed
2,Active,Yes,Asian,3.5,1480.0,1460.0,,A Matriculate student or advisor,MSA\nTech Crew for Stuyvesant Theatre Communit...,CollegePoint,80-Completed
3,Active,Yes,Asian,4.0,1500.0,1450.0,,"A peer,A Matriculate student or advisor","Stuyvesant's Honor Society (ARISTA), Red Cross...",CollegePoint,80-Completed
4,Active,Yes,Asian,4.0,1410.0,1380.0,,"A peer,A Matriculate student or advisor",Traditional Shotokan Karate (student and assis...,CollegePoint,80-Completed


In [4]:
mat_df.columns

Index(['Status', 'First Generation', 'Ethnicity', 'GPA', 'SAT', 'PSAT', 'ACT',
       'How Did You Hear About Matriculate?', 'Extracurricular Activities',
       'Program Type', 'FAFSA Status'],
      dtype='object')

In [5]:
mat_df.rename(columns = {'Program Type': 'Program_Type', 'FAFSA Status': 'FAFSA_Status'}, inplace=True)

## 1. % of HSFs that are Active

Step 1: Find the number of active HSFs

In [6]:
mat_df.Status.value_counts()

Active                              975
Never Engaged - no response         531
Inactive                            286
Never Engaged - not interested       86
Never Engaged - no first meeting     85
Never Engaged - interested later     38
Name: Status, dtype: int64

From the value_counts() function, we see that there are 975 active HSFs. 

In [7]:
active = mat_df[mat_df['Status'] == 'Active'].shape[0]

Putting the number of active HSFs into a variable 'active' to use later on. 

Step 2: Find total number of HSFs

In [8]:
total_HSFs = mat_df.shape[0]
total_HSFs

2001

There are 2001 HSFs (from the total number of observations/rows). 

Step 3: Calculate the % of active HSFs by dividing the number of active HSFs with the total number of HSFs

In [9]:
active/total_HSFs

0.487256371814093

48.7% of HSFs are active from this cohort.

## 2. % of HSFs that are non-white

Step 1: Find number of HSFs who are non-white

In [10]:
non_white = mat_df[mat_df['Ethnicity'] != 'White'].shape[0]
non_white

1634

Step 2: divide the number of non-white HSFs with all the HSFs to find the %

In [11]:
non_white/total_HSFs

0.816591704147926

81.7 % of all HSFs are non-white

## 3. % of Active HSFs that completed their FAFSA

Step 1: Find the number of HSFs who are active and have completed their FAFSA

In [12]:
active

975

In [13]:
active_FAFSA_complete = mat_df[(mat_df['Status'] == 'Active') & (mat_df['FAFSA_Status'] == '80-Completed')].shape[0]
active_FAFSA_complete

901

Step 2: Divide the number of HSFs who are active and have completed their FAFSA by the number of active HSFs

In [14]:
active_FAFSA_complete/active

0.9241025641025641

92.4% of active HSFs have completed their FAFSA

## 4. % of College Board HSFs that are First Generation students

Step 1: Find the number of College Board HSFs

In [15]:
CB_HSFs = mat_df[mat_df['Program_Type'] == 'College Board'].shape[0]
CB_HSFs

233

Step 2: Find the number of College Board HSFS who are also First Generation students

In [16]:
CB_First_Gen = mat_df[(mat_df['Program_Type'] == 'College Board') & (mat_df['First Generation'] == 'Yes')].shape[0]
CB_First_Gen

9

Step 3: Divide the number of First Generation College Board HSFs by the number of College Board HSFs

In [18]:
CB_First_Gen / CB_HSFs

0.03862660944206009

3% of College Board HSFs are First Generation students

## 5. % of Inactive HSFs who also have an SAT below 1500

Step 1: Find the number of Inactive HSFs

In [22]:
inactive = mat_df[mat_df['Status'] == 'Inactive'].shape[0]
inactive # which is the same as the number of inactive HSFs we found with the value_counts() function

286

Step 2: Find the number of inactive HSFs who also have an SAT score below 1500

In [24]:
inactive_SAT_1500 = mat_df[(mat_df['Status'] == 'Inactive') & (mat_df['SAT'] < 1500)].shape[0]
inactive_SAT_1500

64

Step 3: Divide the number of inactive HSFs who an SAT score below 1500 with the number of inactive HSFs

In [25]:
inactive_SAT_1500/inactive

0.22377622377622378

22% of inactive HSFs have an SAT score of below 1500. We can't say anything definitive about there being a correlation between having a lower SAT score with being inactive since being inactive can have more to do with other variables such as not being interested in the program because they have a good counselor, etc. 

## 6. % of Asian HSFs who are First Generation students

Step 1: Find number of Asian HSFs

In [28]:
asian = mat_df[mat_df['Ethnicity'] == 'Asian'].shape[0]
asian

448

Step 2: Find the number of first generation Asian students

In [32]:
asian_first_gen = mat_df[(mat_df['Ethnicity'] == 'Asian') & (mat_df['First Generation'] == 'Yes')].shape[0]
asian_first_gen

267

Step 3: Divide the number of First Generation Asian students by the number of Asian students

In [35]:
asian_first_gen/asian

0.5959821428571429

59.6% of Asian HSFs are First Generation!