<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

# Pandas Basics

## 1.0 Importing Pandas Library

In [1]:
# Let's import the pandas library just like we did with Numpy
# 
import pandas as pd

## 1.1 Performing Basic Operations

In [3]:
# Example 1
# Pandas is built on the NUmpy package.
# It's main data structure is called DataFrame. Dataframes allows us to store and manipulate data in 
#row and column observations


# In this section, we will cover some of the basic operations 
# that we perform while using pandas.
# The first thing we will do is to create a Dataframe from a dictionary as shown below;
# Think of a Dataframe as a table. By definition, a DataFrame is a 2-dimensional labeled data 
#structure 
# with columns of potentially different types.
# 
data = [{'name': 'vikash', 'age': 27}, {'name': 'Satyam', 'age': 14}]
df = pd.DataFrame.from_dict(data, orient='columns')
df

# Something to note in the results is that everytime you create a dataframe it will automatically assign indexes to the row. 

Unnamed: 0,name,age
0,vikash,27
1,Satyam,14


In [4]:
# Example 3
# Creating a Dataframe with randomly generated data
# 

# We will import and use numpy in this example
import numpy as np

np_mat = np.random.randint(0,5,size=(5, 4))

print(np_mat)

# Uncomment the following lines after running the previous lines 
df = pd.DataFrame(np_mat, columns=list('ABCD'))
df

[[1 3 1 2]
 [1 0 1 3]
 [3 4 4 2]
 [3 4 3 4]
 [0 0 4 1]]


Unnamed: 0,A,B,C,D
0,1,3,1,2
1,1,0,1,3
2,3,4,4,2
3,3,4,3,4
4,0,0,4,1


In [5]:
# Example 2
# We can also create a Dataframe by inserting rows iteratively
# 

# For this example, we will use the randint() function 
# thus we will need to import it
from random import randint

# We will also need to declare the columns that we will need 
columns = ['a', 'b', 'c']

# Then creating our dataframe
df = pd.DataFrame(columns=columns)

# lastly append random values to the dataframe iteratively using a for loop.
# We are going to use two for loops. the first one will be for the number of rows and the second one will be for the number of columns.
# In the outer loop, we will create a range of number from 0-5, then iterate through it. This means that we will have 6 rows
# We'll explain the logic of this code from the inside out. 
# So inside the inner loop, we  aim to populate our dataframe with random integers that are between -1 and 1.
# Hence everytime we generate out random number, we use pandas dataframe method called loc[] to insert the random number in either of the three columns
# The .loc method works in the same way slicing works in python list. This means that it can be used to access elements inside a dataframe. As such, we can also use it to update elements in a dataframe. When we created the  empty dataframe earlier, it meant that the elements were null. So in our code we are simply updating the null elements with values.
# The logic for this code can be alittle bit confusing at first, so spend a little bit of time with your pair trying understand how the code works as it will help you alot in the future.
for i in range(7): #6 rows
    for c in columns:
      df.loc[i,c] = randint(-1,1)
  
# and printing out the dataframe
df


Unnamed: 0,a,b,c
0,-1,0,0
1,-1,1,-1
2,-1,-1,0
3,-1,-1,1
4,-1,0,1
5,-1,1,1
6,-1,1,1


In [6]:
# Example 4
# Creating a Dataframe from a csv file 
# 
#df = pd.read_csv('sample_data/california_housing_test.csv', delimiter = ',')
#df

# Uncomment the following lines after running the previous lines
df_url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
df = pd.read_csv(df_url)
df

URLError: <urlopen error [Errno 11001] getaddrinfo failed>

In [7]:
# Example 5
# Changing Dataframe column names 
# 
df_list = [['AA', 1, 'a'],['BB', 2, 'a'],['CC', 3, 'a']]

df = pd.DataFrame(df_list, columns = ['name','value','salue'])
df

# Uncomment the following lines after running this cell once
df.columns.values[1:] = ['prefix_' + val for val in df.columns.values[1:]]
df.columns.values
df

Unnamed: 0,name,prefix_value,prefix_salue
0,AA,1,a
1,BB,2,a
2,CC,3,a


In [8]:
# Example 6
# A simpler way of changing Dataframe column names 
#
df_list = [['AA', "temp", 1],['BB', "temp", 2],['CC', "temp", 3]]
df = pd.DataFrame(df_list, columns = ['name','temp', 'value'])
df

# Uncomment the following line
df.columns = ['names', 'temperature', 'values']
df

Unnamed: 0,names,temperature,values
0,AA,temp,1
1,BB,temp,2
2,CC,temp,3


In [19]:
# Example 7
# Choosing specific columns from a DataFrame
# 
df_list = [['AA', "temp", 1],['BB', "temp", 2],['CC', "temp", 3]]
df = pd.DataFrame(df_list, columns = ['name','temp', 'value'])
df

# Uncomment the following lines after running 
df = df[["name","temp"]]
df

Unnamed: 0,name,temp
0,AA,temp
1,BB,temp
2,CC,temp


In [9]:
# Example 8
# Deleting/dropping columns or extracting columns from Dataframe 
# 
df_list = [['AA', "temp", 1],['BB', "temp", 2],['CC', "temp", 3]]
df = pd.DataFrame(df_list, columns = ['name','value','temp'])
df

# Uncomment the following lines
df.drop('value', axis=1, inplace=True)
df

# Uncomment the following lines after running the previous commented lines
df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])
df

# Uncomment the following lines after running the previous commented lines
values = df.pop('value')
df

# Uncomment the line below after running the previous commented lines
values

0    1
1    2
2    3
Name: value, dtype: int64

### <font color="green">1.1 Challenges</font>

In [20]:
# Challenge 1
# Create a Dataframe from the following dictionary 
# 
studentData = {
    'name' : ['jack', 'Riti', 'Aadi'],
    'age' : [34, 30, 16],
    'city' : ['Sydney', 'Delhi', 'New york']
}

df = pd.DataFrame.from_dict(studentData)
df

Unnamed: 0,name,age,city
0,jack,34,Sydney
1,Riti,30,Delhi
2,Aadi,16,New york


In [11]:
# Challenge 2
# Create the Dataframe shown below in the Expected Output by inserting rows iteratively
# 
from random import choice

columns = ['X', 'Y', 'Z']
df = pd.DataFrame(columns=columns)

value_range = [1, 2, 3]

for row in range(4):
    for col in columns:
        df.loc[row, col] = choice(value_range)
        
df

Unnamed: 0,X,Y,Z
0,3,1,1
1,2,2,3
2,3,2,3
3,2,2,2


In [12]:
# Challenge 2: Expected Output
# [Do not run this cell]
# Running this cell will clear the output

In [13]:
# Challenge 3
# Create the Dataframe shown in the Expected Output below with randomly generated integers
# 
from random import choice

columns = ['A', 'B', 'C', 'D']
df = pd.DataFrame(columns=columns)

value_range = [x for x in range(10)]

for row in range(10):
    for col in columns:
        df.loc[row, col] = choice(value_range)
        
df

Unnamed: 0,A,B,C,D
0,1,4,0,3
1,5,7,0,4
2,1,2,2,6
3,1,9,3,1
4,5,7,0,1
5,0,0,6,7
6,3,7,4,3
7,5,4,9,6
8,9,9,7,1
9,5,0,8,1


In [14]:
# Challenge 3: Expected Output
# Running this cell will clear the output
# Attention: Do not run this cell!

In [15]:
# Challenge 4
# Create a Dataframe from the mnist_test csv file in the sample_data directory
#


In [None]:
# Challenge 5
# Create a Dataframe from dataset with the following url source
# URL: http://bit.ly/NairobiBusesDataset
#
dataset_url = "http://bit.ly/NairobiBusesDataset"
df = pd.read_csv(dataset_url)
df

In [1]:
# Challenge 6
# Change the column names of the dataset from this source (http://bit.ly/FiveYearData) 
# to: country, year, population, continent, life_exp, gdp_per_cap
#
import pandas as pd

dataset_url = "https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv"
df =  pd.read_csv(dataset_url)
df.columns = ['country', 'year', 'population', 'continent', 'life_exp', 'gdp_per_cap']
df

Unnamed: 0,country,year,population,continent,life_exp,gdp_per_cap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.853030
2,Afghanistan,1962,10267083.0,Asia,31.997,853.100710
3,Afghanistan,1967,11537966.0,Asia,34.020,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,1987,9216418.0,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340.0,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948.0,Africa,46.809,792.449960
1702,Zimbabwe,2002,11926563.0,Africa,39.989,672.038623


In [2]:
# Challenge 7
# Choose the country, year and continent columns from a DataFrame in challenge 6
#
df1 = df.loc[:, ['country', 'year', 'continent']]
df1

Unnamed: 0,country,year,continent
0,Afghanistan,1952,Asia
1,Afghanistan,1957,Asia
2,Afghanistan,1962,Asia
3,Afghanistan,1967,Asia
4,Afghanistan,1972,Asia
...,...,...,...
1699,Zimbabwe,1987,Africa
1700,Zimbabwe,1992,Africa
1701,Zimbabwe,1997,Africa
1702,Zimbabwe,2002,Africa


In [3]:
# Challenge 8
# Drop the population and life_exp columns from the DataFrame in challenge 6
#
df.drop(columns=['population', 'life_exp'], axis=1, inplace=True)
df

Unnamed: 0,country,year,continent,gdp_per_cap
0,Afghanistan,1952,Asia,779.445314
1,Afghanistan,1957,Asia,820.853030
2,Afghanistan,1962,Asia,853.100710
3,Afghanistan,1967,Asia,836.197138
4,Afghanistan,1972,Asia,739.981106
...,...,...,...,...
1699,Zimbabwe,1987,Africa,706.157306
1700,Zimbabwe,1992,Africa,693.420786
1701,Zimbabwe,1997,Africa,792.449960
1702,Zimbabwe,2002,Africa,672.038623


## 1.2 Manipulating Dataframes

In [5]:
# Example 1
# While working with dataframes, sometimes we may want to iterate over our dataframe and do sopme operations on each row. Pandas gives us two methods to enable us to do this.
# We are going to look at them in the following example.
# 

df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])
df

# Uncomment after running previous lines
# Since we are iterating over rows, we are going to use a pandas method called iterrows(). This method returns an iterator
#that contains indices of each row and data on each row. Each row is stored in a Series
for index, row in df.iterrows():
    print(index, row['name'], row['value'])


# Uncomment after runnuni previous lines
# The other method we that is available to us is itertuples() method. This method loops through each row and returns a named 
#tuple.
for row in df.itertuples():
    print(row)

0 AA 1
1 BB 2
2 CC 3
Pandas(Index=0, name='AA', value=1)
Pandas(Index=1, name='BB', value=2)
Pandas(Index=2, name='CC', value=3)


In [None]:
# Example 2
# Applying a function to Dataframe row
# This is useful when cleaning up data - converting formats, altering values etc.
#  In this example we are going to create a third column then create a function that concatenates the values of the first and second column.
df = pd.DataFrame([['AA', 1],['BB', 2],['CC', 3]], columns = ['name','value'])
df

# Uncomment after running previous lines

# Define a function that takes in two values and returns the two values concatenated together
def function_1(val_1, val_2):
# before retuning the values, we convert the first values into a string because values in one column are strings
    return val_1 + str(val_2)

# Create a third column called col_a then apply a function that we defined above using python lambda.
# Since lambda is a new concept, take a few minutes to read about it
# here: https://www.afternerd.com/blog/python-lambdas/ . Its not a difficult concept to grasp, so it should not take alot 
#of time to understand.
df['col_a'] = df.apply(lambda row: function_1(row['name'], row['value']), axis=1)
df

# Uncomment after running previous lines
# We create a new function that takes in a row value and multiplies it by 2
def function_2(row):
    return row['value'] * 2

# Create another column called col_b, that applies the above function and populates the column with our new values
df['col_b'] = df.apply(lambda row: function_2(row), axis=1)
df

In [None]:
# Example 3
# Applying a function to a specific column of Dataframe
# 
df = pd.DataFrame([['AA', 1], ['BB', 2], ['CC', 3]], columns=['name', 'value'])
df

# Uncomment after running previous lines
def function_1(val_1):
    return "prefix_" + str(val_1)
  
# Uncomment after running previous lines
# To be able to apply a function to the elements of the name column, we use the map function. This function allows us
#to apply a specific function to all the elements of the targeted column.
df['name'] = df['name'].map(function_1)
df 


In [86]:
import pandas as pd
# Example 4
# Finding and replacing a value in Dataframe
# 
df = pd.DataFrame([['One', 'Two'], ['Four', 'Abcd'], ['One', 'Bcd'], ['Five', 'Cd']], columns=['A', 'B'])
df

# Uncomment after running previous lines
df.loc[df['A'] == 'One', 'A'] = 0
df

0    One
2    One
Name: A, dtype: object

### <font color="green">1.2 Challenges</font>

In [3]:
# Challenge 1
# Create a Dataframe from the Dictionary below and iterate over the rows
graduates = {'name':["Jane Njoroge", "June Adhiambo", "Kevin Swale", "Heidi Sang"], 
        'degree': ["MBA", "BCA", "M.Tech", "MBA"], 
        'score':[90, 40, 80, 98]} 

df =  pd.DataFrame(graduates)

#iterating through
for index, row in df.iterrows():
    print(f"{index}, {row['name']}, {row['degree']}, {row['score']}")

0, Jane Njoroge, MBA, 90
1, June Adhiambo, BCA, 40
2, Kevin Swale, M.Tech, 80
3, Heidi Sang, MBA, 98


In [None]:
import pandas as pd
# Challenge 2
# Apply the given function to SAL-RATE column
# http://bit.ly/EmployeeSalary

df = pd.read_csv("https://raw.githubusercontent.com/DanaZL/Introduction_to_Data_Analysis/master/Analysis_employees_salaries/Civil_List_2014.csv")

# Get rid of $ and , in the SAL-RATE, then convert it to a float
#def money_to_float(money_str):
#    return float(money_str.replace("$","").replace(",",""))

df

In [None]:
# Challenge 3
# Replace the value in M.Tech in Challenge 1 Dataframe with Msc.Tech
# 
df[]

## 1.3 Splitting and Merging Dataframes


In [5]:
import pandas as pd
# Example 1
# Merging Dataframes by columns using join
# Create the first dataframe
df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])
print(df)

# Uncomment and run after running previous lines
# Create the second dataframe
df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])
print(df2)

# Uncomment and run after running previous lines
#df.merge(df2, how='left', on='A')  # merges on columns A

# Uncomment and run after running previous lines
df2.drop_duplicates(subset=['A'], inplace=True)
df2

# Uncomment and run after running previous lines
df.merge(df2, how='left', on='A')

   A  B
0  1  3
1  2  4
   A  C
0  1  5
1  1  6


Unnamed: 0,A,B,C
0,1,3,5.0
1,2,4,


In [9]:
# Example 2
# Merging Dataframes by columns on index
# 
df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])
print(df)

# Uncomment and run after running previous lines
df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'D'])
print(df2)

# Uncomment and run after running previous lines
pd.concat([df, df2], axis=1)

   A  B
0  1  3
1  2  4
   A  D
0  1  5
1  1  6


Unnamed: 0,A,B,D
0,1,3.0,
1,2,4.0,
0,1,,5.0
1,1,,6.0


In [21]:
import numpy as np
# Example 3
# Merging Dataframes and splitting again 
# 
ts1 = [1,2,3,4]
ts2 = [6,7,8,9]
d = {'col_1': ts1, 'col_2': ts2}
d

# Uncomment and run after running previous lines
df_1 = pd.DataFrame(d)
df_1

# Uncomment and run after running previous lines
df_2 = pd.DataFrame(np.random.randn(3, 2), columns=['col_1', 'col_2'])
df_2

# Uncomment and run after running previous lines
df_all = pd.concat((df_1, df_2), axis=0, ignore_index=True)
df_all

# Uncomment and run after running previous lines
print(df_1.shape)
print(df_2.shape)
print(df_all.shape)

# Uncomment and run after running previous lines
print(df_1.shape)
print(df_2.shape)
print(df_all.shape)

# Uncomment and run after running previous lines
# print(df_train.shape)
# print(df_test.shape)
# print(df_all.shape)

(4, 2)
(3, 2)
(7, 2)
(4, 2)
(3, 2)
(7, 2)


In [100]:
# Example 4
# Grouping by a Dataframe and iterating over grouped series
# 
classes = ["class 1"] * 5 + ["class 2"] * 5
sub_class = ['c1','c2','c2','c1','c3'] + ['c1','c2','c3','c2','c3']
vals = [1,3,5,1,3] + [2,6,7,5,2]
p_df = pd.DataFrame({"class": classes, "sub_class": sub_class, "vals": vals})
p_df

# Uncomment and run after running previous lines
grouped = p_df.groupby(['class', 'sub_class'])['vals'].median()
grouped

# Uncomment and run after running previous lines
for index_val, value in grouped.iteritems():
    class_name, sub_class_name = index_val
    print(class_name, ":", sub_class_name, ":", value)

class 1 : c1 : 1.0
class 1 : c2 : 4.0
class 1 : c3 : 3.0
class 2 : c1 : 2.0
class 2 : c2 : 5.5
class 2 : c3 : 4.5


### <font color="green">1.3 Challenges</font>

In [52]:
# Challenge 1
# Create the Dataframes from the given Dictionaries below and then merge them 
#
dt1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Fridah', 'Kwasi', 'Victor', 'Alice', 'Audrey'], 
        'last_name': ['Njeri', 'Adi Dako', 'Oliech', 'Tergat', 'Cheng']}

dt2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Michael', 'Judy', 'Lenny', 'Mohammed', 'Betty'], 
        'last_name': ['Macharia', 'Waithera', 'Baraza', 'Ali', 'Kyalo']}

dt1 = pd.DataFrame(dt1)
dt2 = pd.DataFrame(dt2)
dt = dt1.merge(dt2, how="outer", on="subject_id")
dt

Unnamed: 0,subject_id,first_name_x,last_name_x,first_name_y,last_name_y
0,1,Fridah,Njeri,,
1,2,Kwasi,Adi Dako,,
2,3,Victor,Oliech,,
3,4,Alice,Tergat,Michael,Macharia
4,5,Audrey,Cheng,Judy,Waithera
5,6,,,Lenny,Baraza
6,7,,,Mohammed,Ali
7,8,,,Betty,Kyalo


In [57]:
# Challenge 2
# Using dt1 and dt2 dictionaries from Challenge 1, create another Dataframe from
# the dt3 Dictionary below. Then merge all of them along the subject_id value
# i.e. Columns from left; subject_id, first_name, last_name, test_id
dt3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}


dt3 = pd.DataFrame(dt3)

frames = {'first':dt1, 'second':dt2}
concat = pd.concat(frames)

Merged = concat.merge(dt3, how='left', on='subject_id')
Merged

Unnamed: 0,subject_id,first_name,last_name,test_id
0,1,Fridah,Njeri,51.0
1,2,Kwasi,Adi Dako,15.0
2,3,Victor,Oliech,15.0
3,4,Alice,Tergat,61.0
4,5,Audrey,Cheng,16.0
5,4,Michael,Macharia,61.0
6,5,Judy,Waithera,16.0
7,6,Lenny,Baraza,
8,7,Mohammed,Ali,14.0
9,8,Betty,Kyalo,15.0


In [75]:
# Challenge 3
# Create dataframes using dt1 and dt2 from Challenge 1 then merge based on indexes
# 
merged = dt1.merge(dt2, left_on=dt1.index, right_on=dt2.index, suffixes=('_first', '_second'))
merged

Unnamed: 0,key_0,subject_id_first,first_name_first,last_name_first,subject_id_second,first_name_second,last_name_second
0,0,1,Fridah,Njeri,4,Michael,Macharia
1,1,2,Kwasi,Adi Dako,5,Judy,Waithera
2,2,3,Victor,Oliech,6,Lenny,Baraza
3,3,4,Alice,Tergat,7,Mohammed,Ali
4,4,5,Audrey,Cheng,8,Betty,Kyalo


In [110]:
# Challenge 4
# Create a dataframe from the following url (http://bit.ly/MDSTelecomData), then 
# and determine the sum of the durations per month of calls, sms and data entries.
# Hint: Use groupby i.e. month, duration
#

data_url = "http://bit.ly/MDSTelecomData"
frame = pd.read_csv(data_url)
frameCopied = frame.copy()
resultFrame = frameCopied.groupby(['item'])['duration'].sum()
resultFrame

item
call    92321.00
data     5164.35
sms       292.00
Name: duration, dtype: float64

In [119]:
# Example 1
# 
# Converting categorical columns to integer columns using label encoding method
# so as to prepare to use categorical data in analysis. 
# For this example, we'll use a new Python library called sklearn(sci-kit learn). Sklearn is machine learning library for python that is used in data mining and data anlysis. 
# As we progress in our data science journey we'll learn more about it's capabilities. You will use this throughout module 1 and 2 of core as well, and in many future projects.
# From this library we are importing a method called LabelEncoder that will help us in converting categorical data into integers
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

# Scenario 1
df = pd.DataFrame({'col_1': [1, 0, 1, 2], 
                   'col_2': [1.2, 3.1, 4.4, 5.5], 
                   'col_3': [1, 2, 3, 4], 
                   'col_4': ['a', 'b', 'c', 'd']})
df
# Some of the models that you will learn about in the future only deal with numbers, particularly integers
# First, we will convert col_4 to be in integer column, and the value within it to be integers as well.
df.info()

# This will study the column, figure out the unique categories, and assign an integer value to each starting from 0
# This is internal to the label_encoder, it doesn't change the dataframe yet
label_encoder.fit(df['col_4'])

# View the labels in the column
list(label_encoder.classes_)

# This is how you transform the categories into intergers
df['col_4'] = label_encoder.transform(df['col_4'])

df
df.info()

#Uncomment after running previous lines
# Slightly different scenario: repeated categories : ['a', 'b', 'b', 'a']. In this scenario, we will have duplicates in our categories
df2 = pd.DataFrame({'col_1': [1, 0, 1, 2], 
                    'col_2': [1.2, 3.1, 4.4, 5.5], 
                    'col_3': [1, 2, 3, 4], 
                    'col_4': ['a', 'b', 'b', 'a']})
df2

# # Fitting the label encoder method in the column
label_encoder.fit(df2['col_4'])

# # Converting the categories into integers
df2['col_4'] = label_encoder.transform(df2['col_4'])
# # view the dataframe with the converted column
df2
df2.info()
df2

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_1   4 non-null      int64  
 1   col_2   4 non-null      float64
 2   col_3   4 non-null      int64  
 3   col_4   4 non-null      object 
dtypes: float64(1), int64(2), object(1)
memory usage: 176.0+ bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_1   4 non-null      int64  
 1   col_2   4 non-null      float64
 2   col_3   4 non-null      int64  
 3   col_4   4 non-null      int32  
dtypes: float64(1), int32(1), int64(2)
memory usage: 176.0 bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_1   4 non-null      int64  
 1   col_2   

Unnamed: 0,col_1,col_2,col_3,col_4
0,1,1.2,1,0
1,0,3.1,2,1
2,1,4.4,3,1
3,2,5.5,4,0


In [124]:
# Example 2
# Reducing high dimensionality from categorical column
# We will learn more about the topic of Dimension reduction in Core
# More info: 
# 1) http://bit.ly/DimensionReductionExample
# 2) http://bit.ly/DimensionReductionProblem
#
df = pd.DataFrame({'groups': ['group 1','group 2','group 1','group 2','group 3','group 4','group 5','group 1','group 2','group 5'], 
                   'vals': [1,2,3,4,5,6,7,8,9,10]})
df

# Uncomment after running previous lines
df['groups'].value_counts()

# Uncomment after running previous lines
high_dim_columns = ['groups']

# Uncomment after running previous lines
for column in high_dim_columns:
    a = pd.DataFrame(df[column].value_counts() <= 2)
    unique_values = a.index[a[column]].values
    df.loc[df[column].isin(unique_values), column] = 'other'
df

Unnamed: 0,groups,vals
0,group 1,1
1,group 2,2
2,group 1,3
3,group 2,4
4,other,5
5,other,6
6,other,7
7,group 1,8
8,group 2,9
9,other,10


In [127]:
# Example 3
# Converting categorical column to one hot encoded column 
# More information on hot encoding: http://bit.ly/HotEncoding
# 
df = pd.DataFrame({'sex': ['M', 'F', 'M', 'F'], 
                   'col_2': [1.2, 3.1, 4.4, 5.5], 
                   'col_3': [1, 2, 3, 4], 
                   'col_4': ['a', 'b', 'c', 'd']})
df

# Uncomment after running previous lines
categorical_variables = ['sex']

# Uncomment after running previous lines
for variable in categorical_variables:
    df[variable].fillna("Missing", inplace=True) # Fill missing data with the word "Missing"
    dummies = pd.get_dummies(df[variable], prefix=variable) # Create array of dummies
    df = pd.concat([df, dummies], axis=1) # Update dataframe to include dummies and drop the main variable
    df.drop([variable], axis=1, inplace=True) 
df

Unnamed: 0,sex,col_2,col_3,col_4,sex_F,sex_M
0,M,1.2,1,a,0,1
1,F,3.1,2,b,1,0
2,M,4.4,3,c,0,1
3,F,5.5,4,d,1,0


### <font color="green">1.4 Challenges</font>

In [134]:
# Challenge 1
# Create a dataframe from the given dictionary below then
# Convert the categorical column to an integer column
# 
dt1 = {'patient': [1, 1, 1, 2, 2], 
        'obs': [1, 2, 3, 1, 2], 
        'diagnosis': [0, 1, 0, 1, 0],
        'score': ['weak', 'strong', 'normal', 'weak', 'normal']}

frame = pd.DataFrame(dt1)

from sklearn.preprocessing import LabelEncoder
encod = LabelEncoder()

encod.fit(frame['score'])
frame['score'] = encod.transform(frame['score'])
frame

Unnamed: 0,patient,obs,diagnosis,score
0,1,1,0,2
1,1,2,1,1
2,1,3,0,0
3,2,1,1,2
4,2,2,0,0


In [143]:
# Challenge 2
# Convert the categorical column to one hot encoded column
# in the following list
# 
list1 = [["Nairobi", "range", 3000], ["Mombasa", "tuktuk", 4000], ["Nakuru", "tuktuk", 1000]]

label_encoder = LabelEncoder()

frame = pd.DataFrame(list1, columns=['city', 'car_type', 'fare_price'])
label_encoder.fit(frame['car_type'])
frame['car_type'] = label_encoder.transform(frame['car_type'])
label_encoder.fit(frame['city'])
frame['city'] = label_encoder.transform(frame['city'])
frame

Unnamed: 0,city,car_type,fare_price
0,1,0,3000
1,0,1,4000
2,2,1,1000


## 1.5 Splitting Columns


In [146]:
# Example 1
# Splitting a column using a delimiter
# 
data = [{'test': 'vikash|Arpit', 'val': 6},
        {'test': 'vikash_1|arpit|Vinayp', 'val': 3},
        {'test': 'arpit|vinayp', 'val': 2}]
df = pd.DataFrame.from_dict(data, orient='columns')
df

# Uncomment after running previous lines
df['test'].apply(lambda x: pd.Series([i for i in reversed(x.lower().split('|'))]))

Unnamed: 0,0,1,2
0,arpit,vikash,
1,vinayp,arpit,vikash_1
2,vinayp,arpit,


In [145]:
# Example 2
# Splitting a column using delimiter and one hot encode the values 
# 
data = [{'test': 'vikash|Arpit', 'val': 6},
        {'test': 'vikash_1|arpit|Vinayp', 'val': 3},
        {'test': 'arpit|vinayp', 'val': 2}]

df = pd.DataFrame.from_dict(data, orient='columns')
df

# Uncomment after running previous lines
# chosen_columns = set()
# for idx, row in df.iterrows():
#     for val in str(row['test']).lower().split('|'):
#         chosen_columns.add(val.strip())

# Uncomment after running previous lines
# chosen_columns_list = list(chosen_columns)

# Uncomment after running previous lines
# chosen_columns_list.sort(key=len, reverse=True) 
# chosen_columns_list


# def get_one_hot_encoded_column(col_value):
#     col_value = col_value.lower()
#     new_col_value = ''
#     for val in chosen_columns_list:
#         if val in col_value.split('|'):
#             col_value = col_value.replace(val, '')
#             new_col_value += '1,'
#         else:
#             new_col_value += '0,'
#     return new_col_value[:-1]

# Uncomment after running previous lines
# df['test_new'] = df['test'].map(get_one_hot_encoded_column)
# df

# Uncomment after running previous lines
# df2 = df['test_new'].apply(lambda x: pd.Series([i for i in x.lower().split(',')]))
# df2

# Uncomment after running previous lines
# df2.columns = chosen_columns_list
# df2

# Uncomment after running previous lines
# df2.info()

# Uncomment after running previous lines
# df2 = df2.apply(pd.to_numeric)

# Uncomment after running previous lines
# df2.info()

# Uncomment after running previous lines
# df_new = pd.concat([df, df2], axis=1)

# Uncomment after running previous lines
# df_new.drop(['test', 'test_new'], inplace=True, axis=1)
# df_new

Unnamed: 0,test,val
0,vikash|Arpit,6
1,vikash_1|arpit|Vinayp,3
2,arpit|vinayp,2


### <font color="green">1.5 Challenges</font>

In [149]:
# Challenge 1
# Split the following dataframe into multiple rows
#
df = pd.DataFrame({
   'EmployeeId': ['123', '124', '125', '126', '126'],
   'City': ['Nairobi|Mombasa', 'Nakuru|Nairobi|Kisumu', 'Nairobi|Mombasa', 'Nairobi|Nakuru', 'Mombasa'] 
})

df['City'].apply(lambda x: pd.Series([i for i in x.split("|")]) )

Unnamed: 0,0,1,2
0,Nairobi,Mombasa,
1,Nakuru,Nairobi,Kisumu
2,Nairobi,Mombasa,
3,Nairobi,Nakuru,
4,Mombasa,,


In [155]:
# Challenge 2
# Split the following dataframe 
# 

list_1 = [1,2,3,4,5]
list_2 = [2,4,6,8,10]
list_3 = ['one','two,three,four','three,four,five','four,five','five']

df = pd.DataFrame({'A' : list_1, 'B' : list_2, 'C' : list_3})

df['C'].apply(lambda x: pd.Series([i for i in x.split(",")]))

Unnamed: 0,0,1,2
0,one,,
1,two,three,four
2,three,four,five
3,four,five,
4,five,,
