# Pandas 101

This checkpoint contains many of the basic tasks you might need to do with Pandas!  At the end of an hour, commit and push what you have (remember, you can always return to this book later for practice)

In [21]:
# Run this import cell without changes

#data manipulation
import pandas as pd

#dataset
from sklearn.datasets import load_boston

## Loading in the Boston Housing Dataset

In [2]:
# Run this cell without changes
boston = load_boston()

The variable `boston` is now a dictionary with several key-value pairs containing different aspects of the Boston Housing dataset.  

#### What are the keys to `boston`?  

In [32]:
#__SOLUTION__
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

#### Use the print command to print out the metadata for the dataset contained in the key `DESCR`

In [25]:
#__SOLUTION__

print(boston['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

#### Create a dataframe named "df_boston" with data contained in the key `data`.  Make the column names of `df_boston` the values from the key `feature_names`

In [31]:
#__SOLUTION__
    
df_boston = pd.DataFrame(boston['data'], columns=boston['feature_names'])

The key `target` contains the median value of a house.  

#### Add a column named "MEDV" to your dataframe which contains the median value of a house

In [37]:
#__SOLUTION__

df_boston['MEDV'] = boston['target']

## Data Exploration

#### Show the first 5 rows of the dataframe with the `head` method

In [19]:
#__SOLUTION__
df_boston.head()

#### Show the summary statistics of all columns with the `describe` method

In [18]:
#__SOLUTION__
df_boston.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


#### Check the datatypes of all columns, and see how many nulls are in each column, using the `info` method

In [26]:
#__SOLUTION__

df_boston.info()

## Data Selection

#### Select all values from the column that contains the weighted distances to five Boston employment centres

*Hint: you printed out the information about what information variables contain in a cell above*

In [33]:
#__SOLUTION__
df_boston['DIS']

0      4.0900
1      4.9671
2      4.9671
3      6.0622
4      6.0622
        ...  
501    2.4786
502    2.2875
503    2.1675
504    2.3889
505    2.5050
Name: DIS, Length: 506, dtype: float64

#### Select rows 10-20 from the AGE, NOX, and MEDV columns

In [39]:
#__SOLUTION__
df_boston.loc[10:20, ['AGE', 'NOX', 'MEDV']]

Unnamed: 0,AGE,NOX,MEDV
10,94.3,0.524,15.0
11,82.9,0.524,18.9
12,39.0,0.524,21.7
13,61.8,0.538,20.4
14,84.5,0.538,18.2
15,56.5,0.538,19.9
16,29.3,0.538,23.1
17,81.7,0.538,17.5
18,36.6,0.538,20.2
19,69.5,0.538,18.2


#### Select all rows where NOX is greater than .7 and CRIM is greater than 8

In [40]:
#__SOLUTION__
mask = (
    (df_boston['NOX']>.7) &
    (df_boston['CRIM']>8)
)
df_boston[mask]

## Data Manipulation

#### Add a column to the dataframe called "MEDV*TAX" which is the product of MEDV and TAX

In [41]:
#__SOLUTION__
df_boston['MEDV*TAX'] = df_boston['MEDV']*df_boston['TAX']

#### What is the average median value of houses located on the Charles River?

In [44]:
#calculations here




In [49]:
#__SOLUTION__

val = (
    df_boston
    [df_boston['CHAS']==1]
    ['MEDV']
    .mean()
)

val = val*1000

#### Write a sentence in markdown that answers the above question

In [48]:
#__SOLUTION__


'''The average median value of houses located along the Charles River is $28,440'''

'The average median value of houses located along the Charles River is $28,440'