## 00-Pandas-Tutorial-03:  

Create by **John C.S. Lui** for CSCI3320 (Fundamentals of Machine Learning)<br>
**Date:** Jan 23, 2021.

In this lesson, we will learn:
1. Grouping the data
2. Aggregating the data
3. Exploring the data
3. Casting Datatypes and Handling Missing Values

Let's dive into the real dataset.

In [None]:
import pandas as pd

df = pd.read_csv('data/survey_results_public.csv', index_col='Respondent')
schema_df = pd.read_csv('data/survey_results_schema.csv', index_col='Column')

pd.set_option('display.max_columns', 85)
pd.set_option('display.max_rows', 85)

df.head()  # display first 5 data

In [None]:
# Select the feature `ConvertedComp`, and only display the first 15 data

df['ConvertedComp'].head(15)

In [None]:
# Let's find the median of the feature `ConvertedComp`

df['ConvertedComp'].median()

In [None]:
# Let's find the medians of all features which are numeric
df.median()

In [None]:
# let's find "summary" of the numeric data !!!!!

df.describe()

In [None]:
# We can also count on the number of entries for a particular feature
df['ConvertedComp'].count()

In [None]:
df['Hobbyist']  # display value for 'Hobbyist'

In [None]:
# If we want to find the distribution of Yes and No

df['Hobbyist'].value_counts()

In [None]:
df['SocialMedia']

In [None]:
schema_df.loc['SocialMedia'] # display the schema for the feature

In [None]:
df['SocialMedia'].value_counts()  # note that it will skip over 'NaN'

In [None]:
df['SocialMedia'].value_counts(normalize=True)   # Find the the values count, and normalize the counting

In [None]:
df['Country'].value_counts()

In [None]:
country_grp = df.groupby(['Country'])  # group the data by feature `Country`

In [None]:
country_grp.get_group('India')   # Only extract the data which is group by 'India'

From the result above, we know there are 9,061 data points which 'Country=India'

In [None]:
# Set up a filter, e.g., Country == India, then do a value count on the outcomes of SocialMedia.
# In other words, we only do SocialMedia value count for India

filter = df['Country'] == 'India'
df.loc[filter]['SocialMedia'].value_counts()   

In [None]:
# We can find out the SocialMeida value Count for China like this

country_grp['SocialMedia'].value_counts(normalize=True).loc['China']

In [None]:
#
country_grp['ConvertedComp'].median().loc['Germany']  # find out the media income of Germany

In [None]:
#
country_grp['ConvertedComp'].agg(['median', 'mean']).loc['Canada']  # find out the media and average income of Canada

In [None]:
# Set up a filter for India
filter = df['Country'] == 'India'

# Get the number of users who can use Python and satisfy the filter (those from India)
df.loc[filter]['LanguageWorkedWith'].str.contains('Python').sum() 

In [None]:
country_grp['LanguageWorkedWith'].str.contains('Python').sum()  # see this error !!!!!! Haha.

In [None]:
# Number of respondents from various (sorted) countries who know Python

country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())

In [None]:
# Display # of respondents by countries
country_respondents = df['Country'].value_counts()
country_respondents

In [None]:
# Another way to do selection
country_uses_python = country_grp['LanguageWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
country_uses_python

In [None]:
# concatenate two dataframes
python_df = pd.concat([country_respondents, country_uses_python], axis='columns', sort=False)
python_df

In [None]:
# Replace the names of the two features
python_df.rename(columns={'Country': 'NumRespondents', 'LanguageWorkedWith': 'NumKnowsPython'}, inplace=True)

In [None]:
python_df

In [None]:
# Add a feature, 'PctKnowsPython', into the dataframe python_df

python_df['PctKnowsPython'] = (python_df['NumKnowsPython']/python_df['NumRespondents']) * 100
python_df

In [None]:
# Sort it
python_df.sort_values(by='PctKnowsPython', ascending=False, inplace=True)
python_df.head(50)  # display the first 50 data points

## Observation

From the above, we know some countires which have 100% people knowing Python do not necessary means that
many people know Python from that country, it is just because the number of respondents is so small, for example,
the first four data points.

In [None]:
# Only extract information for people from Japan
python_df.loc['Japan']

## Casting Datatypes and Handling Missing Values

Let's start with a simple example

In [None]:
import pandas as pd
import numpy as np

# Define data. As we can see, we have some "None" values,  'NA' in first name, and 'Missing' in age
people = {
    'first': ['Corey', 'Jane', 'John', 'Chris', np.nan, None, 'NA'], 
    'last': ['Schafer', 'Doe', 'Doe', 'Schafer', np.nan, np.nan, 'Missing'], 
    'email': ['CoreyMSchafer@gmail.com', 'JaneDoe@email.com', 'JohnDoe@email.com', None, np.nan, 'Anonymous@email.com', 'NA'],
    'age': ['33', '55', '63', '36', None, None, 'Missing']
}

df = pd.DataFrame(people)

df.replace('NA', np.nan, inplace=True)         # replace with 'NaN'
df.replace('Missing', np.nan, inplace=True)    # replace with 'NaN'

In [None]:
df

In [None]:
# If we want to drop all NAN
df.dropna()

In [None]:
# drop all data points if NA is available in last or email
df.dropna(axis='index', how='all', subset=['last', 'email'])  

In [None]:
# check whether each value is NA
df.isna()

In [None]:
# fill in NA with 0
df.fillna(0)

In [None]:
df.dtypes # find out the data type of all features

In [None]:
# wrong way to get the mean of 'age'
df['age'].mean()    # This will return with errors !!!!!

In [None]:
# change the corresponding type
df['age'] = df['age'].astype(float)

In [None]:
df.dtypes

In [None]:
df['age'].mean()

## Let's try some real data

In [None]:
import pandas as pd

In [None]:
na_vals = ['NA', 'Missing']  # define a list of NA values
df = pd.read_csv('data/survey_results_public.csv', index_col='Respondent', na_values=na_vals)
schema_df = pd.read_csv('data/survey_results_schema.csv', index_col='Column')

In [None]:
pd.set_option('display.max_columns', 85)
pd.set_option('display.max_rows', 85)

In [None]:
df.head()

In [None]:
df['YearsCode'].head(10) # display the first 10 data points  for feature `YearsCode` , we will find some 'NaN'

In [None]:
df['YearsCode'].unique()  # show the unique values in `YearsCode`

In [None]:
df['YearsCode'].replace('Less than 1 year', 0, inplace=True)

In [None]:
df['YearsCode'].replace('More than 50 years', 51, inplace=True)

In [None]:
df['YearsCode'] = df['YearsCode'].astype(float)  # change values to float data type

In [None]:
df['YearsCode'].mean()

In [None]:
df['YearsCode'].median()