In [1]:
import numpy as np
import pandas as pd

## Selecting Data from `DataFrame` Objects

Similiarly to what we found with `Series` objects. You can interact with `DataFrame` objects in ways that sometimes resemble a dictionary and other times a NumPy array.

In [2]:
college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1',
    index_col = 'institution_name')
college_scorecard.head()

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alaska Bible College,102580,884300,8843,Palmer,AK,www.akbible.edu/,3,Bachelors,2,PrivateNonProfit,...,0.3571,0.3333,,,,0.2857,,PrivacySuppressed,,
Alaska Career College,103501,2541000,25410,Anchorage,AK,www.alaskacareercollege.edu,1,Certificate,3,PrivateForProfit,...,0.7078,,0.7941,,,0.786,28700.0,8994,0.707589494,
Alaska Christian College,442523,4138600,41386,Soldotna,AK,www.alaskacc.edu,1,Certificate,2,PrivateNonProfit,...,0.8868,,0.4737,,1.0,0.6792,,PrivacySuppressed,0.0,
Alaska Pacific University,102669,106100,1061,Anchorage,AK,www.alaskapacific.edu,3,Bachelors,2,PrivateNonProfit,...,0.3152,0.7742,,1.0,,0.5297,47000.0,23250,,0.514833663
AVTEC-Alaska's Institute of Technology,102711,3160300,31603,Seward,AK,www.avtec.edu/,1,Certificate,1,Public,...,0.0737,,1.0,,1.0,0.0664,33500.0,PrivacySuppressed,0.846055789,


### Masking

Masking operations likewise return rows from a `DataFrame`, but the **criteria of the masks will be a comparison on one of the columns/Series**. This is somewhat confusing sounding, so let's just demonstrate:

In [None]:
college_scorecard['state'] == 'AK'

In [None]:
# Return all rows where the 'state' Series has a value of 'AK'
college_scorecard[ college_scorecard['state'] == 'AK']

In [None]:
# Which colleges in IN offer Bachelors degrees?
# Again, notice the parathesis here
# Also, notice that I'm assigning it to a variable so that I can use it later
colleges_IN_Bachelors = college_scorecard[
    (college_scorecard['state'] == 'IN') & 
    (college_scorecard['predominant_degree_desc'] == 'Bachelors')
    ]

**NOTE**: You can break down the right hand side of the assignment into two lines for readability of the code. 

In [None]:
colleges_IN_Bachelors

In [None]:
# how many colleges met the criteria?
print(len(colleges_IN_Bachelors))
colleges_IN_Bachelors.shape[0]


### Selecting Multiple Columns of DataFrame

In [None]:
two_columns = college_scorecard[ ['state', 'predominant_degree_desc'] ]
two_columns.head()

**NOTE**: Among the two sets of square brackets `[[ ]]`, the first set is used to select the columns, the second set is used to list the columns you want to select. 

# Handling Missing Data

In [3]:
none_val = None
print(type(none_val))
print(none_val is None)

<class 'NoneType'>
True


In [4]:
none_val*5

TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

In [7]:
# what is this included None?
vals1 = np.array([1, 3, None, 4])
vals1
vals1*5

TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

In [8]:
vals1

array([1, 3, None, 4], dtype=object)

In [9]:
np.sum(vals1)

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

### NaN: Missing numerical data

NaN stands for Not-a-Number

In [10]:
vals1 = np.array([1, np.nan, 3, 4])
vals1*5

array([ 5., nan, 15., 20.])

In [11]:
vals1

array([ 1., nan,  3.,  4.])

In [12]:
vals1.dtype

dtype('float64')

In [13]:
np.sum(vals1)

nan

**Sum of any true number and a nan is a nan**

### np.nansum

Used to treat nan as a zero in adding the elements of the array

In [14]:
np.nansum(vals1)

8.0

In [15]:
# lots of nan related functions...
np.nanmedian(vals1)


3.0

### NaN and None in pandas

Pandas converts both NaN and None as NaN

In [16]:
simple_series = pd.Series([1,np.nan, 2, None])

In [17]:
simple_series

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [18]:
print("sums up total:  ", simple_series.sum())
print("averages of existing numbers:  ", simple_series.mean())

sums up total:   3.0
averages of existing numbers:   1.5


## Operating on Null Values

The following functions help in detecting and handling the null values in Pandas package

| Ufunc for missing values              | Description |                         
|---------------------|----------------------------------------------------------|
|``isnull()``          |Generate a Boolean mask indicating missing values         |
|``notnull()``      |Opposite of isnull()                                      |
|``dropna()``           |Return a filtered version of the data                     |
|``fillna()``         |Return a copy of the data with missing values filled      |


In [19]:
simple_data = pd.Series([1,np.nan, 'Hello', None])
simple_data

0        1
1      NaN
2    Hello
3     None
dtype: object

In [20]:
simple_data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [21]:
~simple_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

In [22]:
simple_data[ ~simple_data.isnull() ]

0        1
2    Hello
dtype: object

In [23]:
simple_data[simple_data.notnull()]

0        1
2    Hello
dtype: object

In [24]:
simple_data

0        1
1      NaN
2    Hello
3     None
dtype: object

In [25]:
simple_data.dropna()

0        1
2    Hello
dtype: object

In [26]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,    6]])

df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [27]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [28]:
df.dropna(axis=1)

Unnamed: 0,2
0,2
1,5
2,6


<div class="alert alert-block alert-info">
<p>
There are other optional parameters that are offered by the ``dropna()`` function on dataframe, like, ``how`` and ``thresh``. **Look at Page 126 of the textbook for more details.** </p>
</div> 

In [29]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [30]:
df.fillna(100)

Unnamed: 0,0,1,2
0,1.0,100.0,2
1,2.0,3.0,5
2,100.0,4.0,6


In [31]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [32]:
df = df.fillna(100)

In [None]:
df

<div class="alert alert-block alert-info">
<p>
There are other optional parameter called method that are offered by the ``fillna()`` function on dataframe, like, ``method='ffill'`` and ``method='bfill'``. **Look at Page 127 of the textbook for more details.** </p>

<p>
**Also, read other important keyword argument ``inplace``. What happens when it is set to `False` and `True`? **
</p>
</div> 

## Working with dataset with missing values

Marketing dataset: This dataset contains questions from questionaries that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Anual Income of Household from the other 13 demographics attributes. [Source](http://sci2s.ugr.es/keel/dataset.php?cod=163)

[Data Dictionary](http://sci2s.ugr.es/keel/dataset/data/classification/marketing-names.txt)

In [33]:
mark_data = pd.read_csv('./data/marketing.csv')

In [34]:
mark_data.head()

Unnamed: 0,Sex,MaritalStatus,Age,Education,Occupation,YearsInSf,DualIncome,HouseholdMembers,Under18,HouseholdStatus,TypeOfHome,EthnicClass,Language,Income
0,2,1.0,5,4.0,5.0,5.0,3,3.0,0,1.0,1.0,7.0,,9
1,1,1.0,5,5.0,5.0,5.0,3,5.0,2,1.0,1.0,7.0,1.0,9
2,2,1.0,3,5.0,1.0,5.0,2,3.0,1,2.0,3.0,7.0,1.0,9
3,2,5.0,1,2.0,6.0,5.0,1,4.0,2,3.0,1.0,7.0,1.0,1
4,2,5.0,1,2.0,6.0,3.0,1,4.0,2,3.0,1.0,7.0,1.0,1


In [35]:
mark_data[ ['YearsInSf', 'Language']]

Unnamed: 0,YearsInSf,Language
0,5.0,
1,5.0,1.0
2,5.0,1.0
3,5.0,1.0
4,3.0,1.0
...,...,...
8988,5.0,1.0
8989,5.0,1.0
8990,5.0,1.0
8991,5.0,1.0


In [36]:
mark_data[ ['YearsInSf', 'Language']].isnull()

Unnamed: 0,YearsInSf,Language
0,False,True
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
8988,False,False
8989,False,False
8990,False,False
8991,False,False


In [37]:
mark_data.isnull()

Unnamed: 0,Sex,MaritalStatus,Age,Education,Occupation,YearsInSf,DualIncome,HouseholdMembers,Under18,HouseholdStatus,TypeOfHome,EthnicClass,Language,Income
0,False,False,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8988,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8989,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8990,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8991,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [38]:
mark_data[ ['YearsInSf', 'Language']].isnull().mean()

YearsInSf    0.101523
Language     0.039920
dtype: float64

In [40]:
mark_data.sample(15)
print(mark_data.shape)
dropped_df = mark_data.dropna(axis=1)
print(dropped_df.shape)

(8993, 14)
(8993, 5)


### Activity:

* How many total responders in the dataset? 


In [42]:
mark_data.shape[0]

8993


* How many missing values for each attribute (column) in the dataset? 


In [48]:
md_counts = mark_data.isnull().sum()



* Which attribute has the most missing values in the dataset? (**Hint**: To get the index of the maximum element you can use [`idxmax()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.idxmax.html) function)



In [50]:
md_counts.idxmax()

'YearsInSf'


* How do you fill the missing values with a `0`? 


In [52]:
mark_data.fillna(0)

Unnamed: 0,Sex,MaritalStatus,Age,Education,Occupation,YearsInSf,DualIncome,HouseholdMembers,Under18,HouseholdStatus,TypeOfHome,EthnicClass,Language,Income
0,2,1.0,5,4.0,5.0,5.0,3,3.0,0,1.0,1.0,7.0,0.0,9
1,1,1.0,5,5.0,5.0,5.0,3,5.0,2,1.0,1.0,7.0,1.0,9
2,2,1.0,3,5.0,1.0,5.0,2,3.0,1,2.0,3.0,7.0,1.0,9
3,2,5.0,1,2.0,6.0,5.0,1,4.0,2,3.0,1.0,7.0,1.0,1
4,2,5.0,1,2.0,6.0,3.0,1,4.0,2,3.0,1.0,7.0,1.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8988,2,5.0,1,1.0,2.0,5.0,1,3.0,2,3.0,1.0,7.0,1.0,1
8989,1,5.0,2,4.0,1.0,5.0,1,4.0,0,3.0,1.0,7.0,1.0,2
8990,2,5.0,1,2.0,1.0,5.0,1,3.0,2,3.0,1.0,7.0,1.0,1
8991,1,1.0,6,4.0,3.0,5.0,2,3.0,1,2.0,3.0,7.0,1.0,4



* **Most Common Use**: Can you fill each missing value with the corresponding average for that attribute? 
    * For example, if 'Education' attribute is missing for a person, can you find the average 'Education' of all people and fill that missing 'Age' with that average. See if you can figure out how you would do that.

In [58]:
mark_data.mean()

avg_lang = mark_data.mean()['Language']
avg_lang

mark_data['Language'].fillna(avg_lang)

0       1.127519
1       1.000000
2       1.000000
3       1.000000
4       1.000000
          ...   
8988    1.000000
8989    1.000000
8990    1.000000
8991    1.000000
8992    1.000000
Name: Language, Length: 8993, dtype: float64

## Activity

In [None]:
mark_data = pd.read_csv('./data/marketing.csv')

In [None]:
mark_data.head()

* How many elements of Occupation are missing? Default all the missing values for Occupation to `1`. Write a line of code that verifies there is no more missing data.

* Drop all the cols with missing data into a new data frame

* Drop all the rows with missing data  into a new data frame