# Introduction to pandas

### Loading and Exploring the Data

Let's begin by loading our data into pandas and assigning it to the variable `high_school_df`.

In [8]:
import pandas as pd
high_school_df = pd.read_csv('./nyc_hs_sat.csv', index_col = 0)

Take a look at the columns that were loaded.

In [10]:
df_cols = high_school_df.columns

df_cols
# Index(['Unnamed: 0', 'dbn', 'name', 'num_test_takers', 'reading_avg',
#        'math_avg', 'writing_score', 'boro', 'total_students',
#        'graduation_rate', 'attendance_rate', 'college_career_rate'],
#       dtype='object')

Index(['dbn', 'name', 'num_test_takers', 'reading_avg', 'math_avg',
       'writing_score', 'boro', 'total_students', 'graduation_rate',
       'attendance_rate', 'college_career_rate'],
      dtype='object')

We can see some initial rows of data by applying the `slice` method, like we would in Python.

In [13]:
high_school_df[:3]

Unnamed: 0,dbn,name,num_test_takers,reading_avg,math_avg,writing_score,boro,total_students,graduation_rate,attendance_rate,college_career_rate
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,M,171,0.66,0.87,0.36
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,M,465,0.9,0.93,0.7
2,01M450,EAST SIDE COMMUNITY SCHOOL,70.0,377.0,402.0,370.0,M,683,0.92,0.94,0.77


Let's just select the `name` column, and assign it to the variable, `high_school_name`. 

In [14]:
high_school_name = high_school_df['name']

In [16]:
high_school_name[:2]

# 0    HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
# 1              UNIVERSITY NEIGHBORHOOD HIGH SCHOOL
# Name: name, dtype: object


0    HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
1              UNIVERSITY NEIGHBORHOOD HIGH SCHOOL
Name: name, dtype: object

Ok, now let's select the columns related mainly to the SAT score.  So let's select the `num_test_takers, reading_avg, math_avg,` `writing_score` columns.  Assign it to the variable `sat_df`.

In [17]:
sat_df = high_school_df[['reading_avg', 'math_avg','writing_score']]

In [20]:
sat_df[:3]

# 	reading_avg	math_avg	writing_score
# 0	355.0	404.0	363.0
# 1	383.0	423.0	366.0
# 2	377.0	402.0	370.0

Unnamed: 0,reading_avg,math_avg,writing_score
0,355.0,404.0,363.0
1,383.0,423.0,366.0
2,377.0,402.0,370.0


### Working with the index column

The index column is currently a list of numbers.  But we can see that we have a columns of `dbn` that stands for `database_number`.  Let's change the index column to be the database number.  

In [21]:
high_school_df.index = high_school_df['dbn']

If we look at our dataframe again, though, now we can see that we have the `dbn` information listed twice.

In [23]:
high_school_df[:3]

Unnamed: 0_level_0,dbn,name,num_test_takers,reading_avg,math_avg,writing_score,boro,total_students,graduation_rate,attendance_rate,college_career_rate
dbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
01M292,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29.0,355.0,404.0,363.0,M,171,0.66,0.87,0.36
01M448,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91.0,383.0,423.0,366.0,M,465,0.9,0.93,0.7
01M450,01M450,EAST SIDE COMMUNITY SCHOOL,70.0,377.0,402.0,370.0,M,683,0.92,0.94,0.77


Let's remove the column from the dataframe, and assign the new dataframe to be `hs_df`.

In [29]:
hs_df = high_school_df.drop(columns = ['dbn'])

In [31]:
'dbn' in hs_df.columns

False

### Summary