# Data Dive and Preparation in Python

In this notebook, we will learn more about how to evaluate the contents of our variables, to "clean" or "process" them, and prepare them for analyses. 


### Reading in our data

First, we will load `pandas` and read in our dataset. 


In [124]:
import pandas as pd
collegeDat = pd.read_csv('data/colleges.csv')

### Making a copy of our original data for processing

Let's first make a copy to which we will make modifications. This preserves are "raw," original data. We can then look back at our raw data in order to check our work. Note that we need to use the `copy()` function to do this properly in Python. Otherwise, if we just use =, we are simply making a new reference to the same data frame. 

### Creating an index

Our dataset already has an ID variable, but adding an index can help us track the order of the observations. 



In [125]:
# first we will find the sample size (the number of rows)
sampsize = collegeDat.shape[0]

# then we will create an index that ranges from to 
# GOOD CODING PRACTIC: DON'T HARD CODE THE SAMPLE SIZE
collegeDat_mod['index'] = pd.Series(range(0,sampsize))

# Check 
collegeDat_mod[['OPEID','name','index']].head(10)


Unnamed: 0,OPEID,name,index
0,100200,Alabama A & M University,0
1,105200,University of Alabama at Birmingham,1
2,2503400,Amridge University,2
3,105500,University of Alabama in Huntsville,3
4,100500,Alabama State University,4
5,105100,The University of Alabama,5
6,100700,Central Alabama Community College,6
7,831000,Auburn University at Montgomery,7
8,100900,Auburn University,8
9,101200,Birmingham-Southern College,9


### Finding and fixing misleading values


In [126]:
# loading numpy to access features
import numpy as np

# describe the the numeric data
collegeDat.describe(include=[np.number])


Unnamed: 0,OPEID,median_debt,default_rate,admit_rate,SAT_avg,enrollment,net_price,avg_cost,net_tuition,ed_spending_per_student,avg_faculty_salary,pct_PELL,pct_fed_loan,grad_rate,pct_firstgen,med_fam_income,med_alum_earnings
count,4435.0,4435.0,4435.0,1704.0,1105.0,4435.0,4435.0,4435.0,4435.0,4435.0,3077.0,4435.0,4435.0,4435.0,4088.0,4399.0,3912.0
mean,1492464.0,11.19579,9.06009,70.812576,1139.842534,3110.519053,17.371474,27.10288,10.836639,7.760832,7.266518,45.55554,49.069461,54.945651,43.357756,31.79193,40.007157
std,1976276.0,5.319178,6.144554,20.567925,131.630792,6429.445325,8.638514,14.988075,7.50641,6.881391,2.528365,20.309775,24.542281,22.051351,12.931312,20.811117,14.486256
min,100200.0,1.932,0.0,2.44,760.0,0.0,-0.407,4.76,0.0,0.0,0.897,0.0,0.0,0.0,8.866995,0.0,10.939
25%,282200.0,6.863,4.4,59.7875,1050.0,171.0,10.849,16.4525,5.4395,4.126,5.61,29.83,30.925,37.31,35.006281,17.82775,29.72025
50%,766900.0,9.5,8.2,74.68,1113.0,868.0,16.757,22.945,9.912,6.352,6.958,42.5,52.54,56.4,45.102178,24.67,38.056
75%,2362002.0,15.0,12.3,86.115,1205.0,2953.0,22.4705,32.0325,14.218,9.342,8.573,60.38,67.68,71.915,52.599727,39.5165,47.38125
max,72098870.0,33.47,57.1,100.0,1566.0,109233.0,112.05,120.377,66.442,139.766,21.143,100.0,100.0,100.0,85.90604,179.864,132.969


In [127]:
# let's imagine the highest SAT possible was 1500 
# we will assume anything higher is a mistake and set to missing
# This is what the code does:
  ## It's looking at each row in the collegeDat DataFrame.
  ## Wherever the value of 'SAT_avg' in that DataFrame is greater than 1500,
  ##  it goes to the corresponding row in the collegeDat_mod DataFrame.
  ## In the 'SAT_avg' column of collegeDat_mod, it replaces those values with np.nan for missing.
collegeDat_mod.loc[collegeDat['SAT_avg'] > 1500, 'SAT_avg'] = np.nan

  # check 
maxSAT = collegeDat['SAT_avg'].max()
maxSAT_mod = collegeDat_mod['SAT_avg'].max()
print("The nax of SAT_avg before:" + str(maxSAT))
print("The nax of SAT_avg after:" + str(maxSAT_mod))

The nax of SAT_avg before:1566.0
The nax of SAT_avg after:1500.0


In [128]:
# describe the character data 
print(collegeDat.describe(include=[object]))

# this doesn't give enough information 
# what if some categories are slightly different but mean the same thing?

print(collegeDat_mod['highest_degree'].unique()) # looks good
print(collegeDat_mod['ownership'].unique()) # looks good
# etc.

# Here's another way to do this that also introduces a for loop
  # make series of column names you want to check
columns_to_check = ['highest_degree', 'ownership', 'hbcu']  

  # loop through and print unique values for each column
for column in columns_to_check:
    unique_values = collegeDat_mod[column].unique()
    print(f"Unique values in {column}: {unique_values}")




                     name      city state   region highest_degree  \
count                4435      4435  4435     4435           4435   
unique               4357      1943    54        7              4   
top     Cortiva Institute  New York    CA  Midwest       Graduate   
freq                    6        51   423     1074           1464   

                 ownership  locale  hbcu online_only  
count                 4435    4435  4435        4435  
unique                   3       5     2           2  
top     Prviate for-profit  Suburb    No          No  
freq                  1684    1311  4348        4413  
['Graduate' 'Associates' 'Bachelors' 'Certificate']
['Public' 'Private nonprofit' 'Prviate for-profit']
Unique values in highest_degree: ['Graduate' 'Associates' 'Bachelors' 'Certificate']
Unique values in ownership: ['Public' 'Private nonprofit' 'Prviate for-profit']
Unique values in hbcu: ['Yes' 'No']


### Handling categorical data

#### Converting categorical data to dummies

Note: The example in the book is probably not a typical task - converting school categories to numeric grades. More often, we want to turn categories into a series of booleans that let us know whether an observation belongs to the category. This is sometimes referred to as "one-hot" coding. 



In [129]:
import pandas as pd

# the get_dummies function creates all the variables we need
# these are in a new data frame
highest_degree_dummies = pd.get_dummies(collegeDat_mod['highest_degree'])

# let's see what these look like
print(highest_degree_dummies.head(5))

# let's convert them to 0/1 for TRUE/FALSE
highest_degree_dummies = highest_degree_dummies.astype(int)

# let's take another look
print(highest_degree_dummies.head(5))

# add these to collegeDat_mod
collegeDat_mod = pd.concat([collegeDat_mod,highest_degree_dummies], axis=1)


   Associates  Bachelors  Certificate  Graduate
0       False      False        False      True
1       False      False        False      True
2       False      False        False      True
3       False      False        False      True
4       False      False        False      True
   Associates  Bachelors  Certificate  Graduate
0           0          0            0         1
1           0          0            0         1
2           0          0            0         1
3           0          0            0         1
4           0          0            0         1


#### Creating new categorical variables

There will be other times that we want to convert numeric data to categories. For example, we might create "bins" of averge SAT scores. 


In [135]:
# defining the edges of the bins
bins = [400, 1000, 1300, 1600]  

# defining the labels for your bins
labels = ['low', 'medium', 'high']

# creating new variable using cut function
collegeDat_mod['SAT_avg_category'] = pd.cut(collegeDat_mod['SAT_avg'], bins=bins, labels=labels, include_lowest=True)

# check 
print(collegeDat_mod[['SAT_avg','SAT_avg_category']].head(10))

# we could also create a category for missing
collegeDat_mod['SAT_avg_category'] = collegeDat_mod['SAT_avg_category'].astype('object')  # Ensure it's of object type in order to hold string
SAT_miss_which = collegeDat_mod['SAT_avg'].isna()
collegeDat_mod.loc[SAT_miss_which, 'SAT_avg_category'] = 'missing'

# check
print(collegeDat_mod[['SAT_avg','SAT_avg_category']].head(10))



   SAT_avg SAT_avg_category
0    959.0              low
1   1245.0           medium
2      NaN              NaN
3   1300.0           medium
4    938.0              low
5   1262.0           medium
6      NaN              NaN
7   1061.0           medium
8   1302.0             high
9   1202.0           medium
   SAT_avg SAT_avg_category
0    959.0              low
1   1245.0           medium
2      NaN          missing
3   1300.0           medium
4    938.0              low
5   1262.0           medium
6      NaN          missing
7   1061.0           medium
8   1302.0             high
9   1202.0           medium
