## 00-Pandas-Tutorial-02:  

Create by **John C.S. Lui** for CSCI3320 (Fundamentals of Machine Learning)<br>
**Date:** Jan 23, 2021.

In this lesson, we will learn:
1. How to add/remove data into our dataframe
2. How to sort our data

Let's start with a simple example first

In [None]:
people = {
    'first': ['Corey', 'Jane', 'John'], 
    'last': ['Schafer', 'Doe', 'Doe'], 
    'email': ['CoreyMSchafer@gmail.com', 'JaneDoe@email.com', 'JohnDoe@email.com']
}
import pandas as pd
df = pd.DataFrame(people)
df

In [None]:
# Let's embedded some spaces at the output
df['first'] + '  ' + df['last']

In [None]:
# Let's add an attribute, 'full_name', to the dataframe

df['full_name'] = df['first'] + ' ' + df['last']
df

In [None]:
# Let's remove the features 'first' and 'last' since they are redundant

df.drop(columns=['first', 'last'], inplace=True)
df

In [None]:
# Let's grap the full_name string, split it via ' ' and expand it

df['full_name'].str.split(' ', expand=True)

In [None]:
# Let's put it back into the dataframes and by so doing, we add two additional features
df[['first name', 'last name']] = df['full_name'].str.split(' ', expand=True)
df

In [None]:
# Let's add an entry to the df
df.append({'first name': 'Tony'}, ignore_index=True)

In [None]:
# We can also define a dictionary and add it to the dataframe
people = {
    'first name': ['Tony', 'Steve'], 
    'last name': ['Stark', 'Rogers'], 
    'email': ['IronMan@avenger.com', 'Cap@avenger.com']
}
df2 = pd.DataFrame(people) # load it to df2

In [None]:
df2

In [None]:
# append df2 to df
df.append(df2, ignore_index=True, sort=False)  # as we can see, some data are not define as "NaN"

In [None]:
# Let's assign it to df
df = df.append(df2, ignore_index=True, sort=False)
df

In [None]:
# Let's delete the last data point in row #4

df.drop(index=4)

In [None]:
# We see that we didn't update the df
df

In [None]:
filt = df['last name'] == 'Doe'
df.drop(index=df[filt].index, inplace=True)  # turn on the inplace argument

In [None]:
df  # now df don't have row 1 and 2

## Sorting data within the dataframe

Let's explore this capability.  Again, we can start with a simple dataframe.

In [None]:
people = {
    'first': ['Corey', 'Jane', 'John', 'Adam'], 
    'last': ['Schafer', 'Doe', 'Doe', 'Doe'], 
    'email': ['CoreyMSchafer@gmail.com', 'JaneDoe@email.com', 'JohnDoe@email.com', 'A@email.com']
}

import pandas as pd

df = pd.DataFrame(people)

In [None]:
df    # display it

In [None]:
df.sort_values(by='last', ascending=True)   # sort on the attribute 'last', in ascending order

In [None]:
df.sort_values(by=['last', 'first'], ascending=True)    # sort on attributes 'last' and then `first', in ascending order

In [None]:
df

In [None]:
# What about ascending in one feature and descending in another feature?

df.sort_values(by=['last', 'first'], ascending=[True, False], inplace=True)  # also update in place
df

In [None]:
df.sort_index()  # sort based on index, but we have not updated df

In [None]:
df['last'].sort_values()  # extract the values under feature 'last', then sort it 

## Let's try a realistic dataset

In [None]:
import pandas as pd

df = pd.read_csv('data/survey_results_public.csv', index_col='Respondent')
schema_df = pd.read_csv('data/survey_results_schema.csv', index_col='Column')

pd.set_option('display.max_columns', 85)
pd.set_option('display.max_rows', 85)

df.head()

In [None]:
# Let's sort based on 'Country' and 'ConvertedComp', in ascending and descending order, and update in place
df.sort_values(by=['Country', 'ConvertedComp'], ascending=[True, False], inplace=True)

In [None]:
# get the first 50 entries
df[['Country', 'ConvertedComp']].head(50)

In [None]:
# Find the n-largest entries in feature 'ConvertedComp'

df['ConvertedComp'].nlargest(10)

In [None]:
# get the 10 smallest entries under 'ConvertedComp'
df.nsmallest(10, 'ConvertedComp')