## 2017 Data Import and Cleaning

#### 1. Read In SAT & ACT  Data

Read in the `sat_2017.csv` and `act_2017.csv` files and assign them to appropriately named pandas dataframes.

In [1]:
import pandas as pd

In [2]:
#Code:
!ls ../data

act_2017.csv      combined_2017.csv sat_2017.csv
act_2018.csv      final.csv         sat_2018.csv


In [3]:
#Code:
act = '../data/act_2017.csv'
sat = '../data/sat_2017.csv'

In [4]:
act_df = pd.read_csv(act)
sat_df = pd.read_csv(sat)

#### 2. Display Data

Print the first 10 rows of each dataframe to your jupyter notebook

In [5]:
act_df.head(10)

Unnamed: 0,State,Participation,English,Math,Reading,Science,Composite
0,National,60%,20.3,20.7,21.4,21.0,21.0
1,Alabama,100%,18.9,18.4,19.7,19.4,19.2
2,Alaska,65%,18.7,19.8,20.4,19.9,19.8
3,Arizona,62%,18.6,19.8,20.1,19.8,19.7
4,Arkansas,100%,18.9,19.0,19.7,19.5,19.4
5,California,31%,22.5,22.7,23.1,22.2,22.8
6,Colorado,100%,20.1,20.3,21.2,20.9,20.8
7,Connecticut,31%,25.5,24.6,25.6,24.6,25.2
8,Delaware,18%,24.1,23.4,24.8,23.6,24.1
9,District of Columbia,32%,24.4,23.5,24.9,23.5,24.2


In [6]:
sat_df.head(10)

Unnamed: 0,State,Participation,Evidence-Based Reading and Writing,Math,Total
0,Alabama,5%,593,572,1165
1,Alaska,38%,547,533,1080
2,Arizona,30%,563,553,1116
3,Arkansas,3%,614,594,1208
4,California,53%,531,524,1055
5,Colorado,11%,606,595,1201
6,Connecticut,100%,530,512,1041
7,Delaware,100%,503,492,996
8,District of Columbia,100%,482,468,950
9,Florida,83%,520,497,1017


#### 3. Verbally Describe Data

Take your time looking through the data and thoroughly describe the data in the markdown cell below. 

In [7]:
act_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 7 columns):
State            52 non-null object
Participation    52 non-null object
English          52 non-null float64
Math             52 non-null float64
Reading          52 non-null float64
Science          52 non-null float64
Composite        52 non-null object
dtypes: float64(4), object(3)
memory usage: 2.9+ KB


In [8]:
sat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
State                                 51 non-null object
Participation                         51 non-null object
Evidence-Based Reading and Writing    51 non-null int64
Math                                  51 non-null int64
Total                                 51 non-null int64
dtypes: int64(3), object(2)
memory usage: 2.1+ KB


Answer:

act_df includes participation rates and ACT English, Math, Reading, Science, and Composite scores for each of the 50 states and Washington DC, as well as aggregate National data.

sat_df includes participation rates and SAT "Evidence-Based Readging and Writing", Math, and Total scores for each of the 50 states and Washington DC. it does not include aggregate National data.

#### 4a. Does the data look complete? 

Answer: yes

#### 4b. Are there any obvious issues with the observations?

**What is the minimum *possible* value for each test/subtest? What is the maximum *possible* value?**

Consider comparing any questionable values to the sources of your data:
- [SAT](https://blog.collegevine.com/here-are-the-average-sat-scores-by-state/)
- [ACT](https://blog.prepscholar.com/act-scores-by-state-averages-highs-and-lows)

In [9]:
sat_df.describe()

Unnamed: 0,Evidence-Based Reading and Writing,Math,Total
count,51.0,51.0,51.0
mean,569.117647,547.627451,1126.098039
std,45.666901,84.909119,92.494812
min,482.0,52.0,950.0
25%,533.5,522.0,1055.5
50%,559.0,548.0,1107.0
75%,613.0,599.0,1212.0
max,644.0,651.0,1295.0


In [10]:
act_df.describe()

Unnamed: 0,English,Math,Reading,Science
count,52.0,52.0,52.0,52.0
mean,20.919231,21.173077,22.001923,21.040385
std,2.332132,1.963602,2.048672,3.151113
min,16.3,18.0,18.1,2.3
25%,19.0,19.4,20.475,19.9
50%,20.55,20.9,21.7,21.15
75%,23.3,23.1,24.125,22.525
max,25.5,25.3,26.0,24.9


Answer: Minimum SAT score for math, 52, appears to be too low, since the minimum socre one can get is 200. Minimum ACT score for Science, 2.3, appears very low.

#### 4c. Fix any errors you identified

**The data is available** so there's no need to guess or calculate anything. If you didn't find any errors, continue to the next step.

In [11]:
sat_df.iloc[20, [3]]

Math    52
Name: 20, dtype: object

In [12]:
sat_df.iloc[20, [3]] = 524

In [13]:
sat_df.iloc[20, [3]]

Math    524
Name: 20, dtype: object

In [14]:
sat_df.describe()

Unnamed: 0,Evidence-Based Reading and Writing,Math,Total
count,51.0,51.0,51.0
mean,569.117647,556.882353,1126.098039
std,45.666901,47.121395,92.494812
min,482.0,468.0,950.0
25%,533.5,523.5,1055.5
50%,559.0,548.0,1107.0
75%,613.0,599.0,1212.0
max,644.0,651.0,1295.0


In [15]:
act_df['Science'][21]

2.3

In [16]:
act_df['Science'][21] = 23.2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [17]:
act_df['Science'][21]

23.2

In [18]:
act_df.describe()

Unnamed: 0,English,Math,Reading,Science
count,52.0,52.0,52.0,52.0
mean,20.919231,21.173077,22.001923,21.442308
std,2.332132,1.963602,2.048672,1.723351
min,16.3,18.0,18.1,18.2
25%,19.0,19.4,20.475,19.975
50%,20.55,20.9,21.7,21.3
75%,23.3,23.1,24.125,23.2
max,25.5,25.3,26.0,24.9


#### 5. What are your data types? 
Display the data types of each feature. 

In [19]:
sat_df.dtypes

State                                 object
Participation                         object
Evidence-Based Reading and Writing     int64
Math                                   int64
Total                                  int64
dtype: object

In [20]:
act_df.dtypes

State             object
Participation     object
English          float64
Math             float64
Reading          float64
Science          float64
Composite         object
dtype: object

What did you learn?
- Do any of them seem odd?  
- Which ones are not as they should be?  

Answer:
For ACT, the 'Composite' column data type is 'object', but it should be 'float64'.
For both ACT and SAT, it might be helpful to change Participation from 'object' type to an integer type.

#### 6. Fix Incorrect Data Types
Based on what you discovered above, use appropriate methods to re-type incorrectly typed data.
- Define a function that will allow you to convert participation rates to an appropriate numeric type. Use `map` or `apply` to change these columns in each dataframe.

In [21]:
print(sat_df.dtypes)
print('')
print(act_df.dtypes)

State                                 object
Participation                         object
Evidence-Based Reading and Writing     int64
Math                                   int64
Total                                  int64
dtype: object

State             object
Participation     object
English          float64
Math             float64
Reading          float64
Science          float64
Composite         object
dtype: object


In [22]:
def participation_to_numeric(dataframe, column = 'Participation'):
    dataframe[column] = dataframe[column].map(lambda x: x.replace('%', ''))
    dataframe[column] = dataframe[column].map(lambda x: int(x))

In [23]:
participation_to_numeric(sat_df)
participation_to_numeric(act_df)

In [24]:
print(sat_df.dtypes)
print('')
print(act_df.dtypes)

State                                 object
Participation                          int64
Evidence-Based Reading and Writing     int64
Math                                   int64
Total                                  int64
dtype: object

State             object
Participation      int64
English          float64
Math             float64
Reading          float64
Science          float64
Composite         object
dtype: object


In [25]:
#act_df['Composite'] = pd.to_numeric(act_df['Composite'])

- Fix any individual values preventing other columns from being the appropriate type.

In [26]:
act_df.Composite[51]

'20.2x'

In [27]:
act_df.Composite[51] = '20.2'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [28]:
act_df.Composite[51]

'20.2'

- Finish your data modifications by making sure the columns are now typed appropriately.

In [29]:
act_df['Composite'] = pd.to_numeric(act_df['Composite'])

In [30]:
act_df.dtypes

State             object
Participation      int64
English          float64
Math             float64
Reading          float64
Science          float64
Composite        float64
dtype: object

- Display the data types again to confirm they are correct.

In [31]:
#Code:
sat_df.dtypes

State                                 object
Participation                          int64
Evidence-Based Reading and Writing     int64
Math                                   int64
Total                                  int64
dtype: object

In [32]:
act_df.dtypes

State             object
Participation      int64
English          float64
Math             float64
Reading          float64
Science          float64
Composite        float64
dtype: object

#### 7. Rename Columns
Change the names of the columns to more expressive names so that you can tell the difference the SAT columns and the ACT columns. Your solution should map all column names being changed at once (no repeated singular name-changes). **We will be combining these data with some of the data from 2018, and so you should name columns in an appropriate way**.

**Guidelines**:
- Column names should be all lowercase (you will thank yourself when you start pushing data to SQL later in the course)
- Column names should not contain spaces (underscores will suffice--this allows for using the `df.column_name` method to access columns in addition to `df['column_name']`.
- Column names should be unique and informative (the only feature that we actually share between dataframes is the state).

In [33]:
act_df.columns = act_df.columns.map(lambda x: x.lower())
act_df.columns = act_df.columns.map(lambda x: x if x == 'state' else '2017_act_' + x)
act_df.columns

Index(['state', '2017_act_participation', '2017_act_english', '2017_act_math',
       '2017_act_reading', '2017_act_science', '2017_act_composite'],
      dtype='object')

In [34]:
# act_df.columns = map(str.lower, act_df.columns)
# act_df = act_df.add_prefix('2017_act_')
# act_df.columns

In [35]:
sat_df.columns = sat_df.columns.map(lambda x: x.lower())
sat_df.columns = sat_df.columns.map(lambda x: x if x == 'state' else '2017_sat_' + x)
sat_df.columns

Index(['state', '2017_sat_participation',
       '2017_sat_evidence-based reading and writing', '2017_sat_math',
       '2017_sat_total'],
      dtype='object')

In [36]:
sat_df.rename(columns={'2017_sat_evidence-based reading and writing':'2017_sat_evidence-based_reading_and_writing' }, inplace=True)
sat_df.columns

Index(['state', '2017_sat_participation',
       '2017_sat_evidence-based_reading_and_writing', '2017_sat_math',
       '2017_sat_total'],
      dtype='object')

#### 8. Create a data dictionary

Now that we've fixed our data, and given it appropriate names, let's create a [data dictionary](http://library.ucmerced.edu/node/10249). 

A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.

Example of a Fictional Data Dictionary Entry: 

|Feature|Type|Dataset|Description|
|---|---|---|---|
|**county_pop**|*integer*|2010 census|The population of the county (units in thousands, where 2.5 represents 2500 people).| 
|**per_poverty**|*float*|2010 census|The percent of the county over the age of 18 living below the 200% of official US poverty rate (units percent to two decimal places 98.10 means 98.1%)|

[Here's a quick link to a short guide for formatting markdown in Jupyter notebooks](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html).

Provided is the skeleton for formatting a markdown table, with columns headers that will help you create a data dictionary to quickly summarize your data, as well as some examples. **This would be a great thing to copy and paste into your custom README for this project.**

|Feature|Type|Dataset|Description|
|---|---|---|---|
|**state**|object|ACT|The US state to which the data applies| 
|**2017_act_participation**|int|ACT|The percentage of students in the state who take the ACT| 
|**2017_act_english**|float|ACT|The average ACT English score for the state (scored between 1 and 36)| 
|**2017_act_math**|float|ACT|The average ACT Math score for the state (scored between 1 and 36)| 
|**2017_act_reading**|float|ACT|The average ACT Reading score for the state (scored between 1 and 36)| 
|**2017_act_science**|float|ACT|The average ACT Science score for the state (scored between 1 and 36)|
|**2017_act_composite**|float|ACT|The average ACT Composite score for the state; the Composite score is the average of the four subject scores (enlish, math, reading, science)| 
|**2017_sat_participation**|int|SAT|The percentage of students in the state who take the SAT| 
|**2017_sat_evidence-based_reading_and_writing**|int|SAT|The average SAT Evidence-Based Reading and Writing score for the state (scored between 200 and 800)| 
|**2017_sat_math**|int|SAT|The average SAT Math score fore the state (scored between 200 and 800)| 
|**2017_sat_total**|int|SAT|The average total score for the state; total score is the sum of evidence-based_reading_and_writing and math, so it can range from 400 to 1600| 


#### 9. Drop unnecessary rows

One of our dataframes contains an extra row. Identify and remove this from the dataframe.

In [37]:
#code
act_df.loc[[0], :]

Unnamed: 0,state,2017_act_participation,2017_act_english,2017_act_math,2017_act_reading,2017_act_science,2017_act_composite
0,National,60,20.3,20.7,21.4,21.0,21.0


In [38]:
act_df.drop(0, inplace=True)

In [39]:
act_df.head()

Unnamed: 0,state,2017_act_participation,2017_act_english,2017_act_math,2017_act_reading,2017_act_science,2017_act_composite
1,Alabama,100,18.9,18.4,19.7,19.4,19.2
2,Alaska,65,18.7,19.8,20.4,19.9,19.8
3,Arizona,62,18.6,19.8,20.1,19.8,19.7
4,Arkansas,100,18.9,19.0,19.7,19.5,19.4
5,California,31,22.5,22.7,23.1,22.2,22.8


#### 10. Merge Dataframes

Join the 2017 ACT and SAT dataframes using the state in each dataframe as the key. Assign this to a new variable.

In [40]:
#Code:
combined_2017 = act_df.merge(sat_df, on = 'state')

#### 11. Save your cleaned, merged dataframe

Use a relative path to save out your data as `combined_2017.csv`.

In [41]:
#code
combined_2017.to_csv('../data/combined_2017.csv')