# Week 6 - Data Wrangling

![data_wrangling.png](attachment:data_wrangling.png)

##  Table of Contents

- Theoretical Overview
- Problem Statement
- Code
    - Importing packages & libraries
    - Duplicated function
    - Map function
    - Replace function
    - Rename function
    - Describe function
    - GetDummies function
    - Quiz

## Theoretical Overview
- Most real-world data are dirty. We must first convert datasets before we can analyze them.
- Data wrangling refers to several procedures intended to convert unstructured data into formats that are easier to work with.
- Data wrangling transforms data from an unorganised or untidy source into something valuable.
- Data Wrangling consists of 6 steps:
    1. Discovery
    2. Structuring
    3. Cleaning
    4. Enriching
    5. Validating
    6. Publishing
- Data transformation is the technological process of translating data from one format, standard, or structure to another without affecting the content of the datasets.
- Data transformation may include:
    1. Constructive (adding, copying)
    2. Destructive (deleting fields and records)
    3. Structural (renaming, moving, and combining of columns)
- In this activity, we will briefly touch on the following subjects: duplicated(), map(), replace(), rename(), cut(), describe(), get_dummies()

## Problem Statement

This notebook is based on the theory and tutorial covered in the slides 63 and 75 of Data Wrangling. In this activity, you have to create multiple dataframes and then apply various data cleaning techniques such as **drop_duplicates, mapping, replace, rename, get_dummies, etc**

## Code

 ### Importing of required libraries:

In [1]:
import pandas as pd
import numpy as np

 ### Duplicated function:

Creating a dataset:

In [2]:
df_d = pd.DataFrame({"a":["one","two"]*3,
                    "b": [1,1,2,3,2,3]})

df_d

Unnamed: 0,a,b
0,one,1
1,two,1
2,one,2
3,two,3
4,one,2
5,two,3


This function checks whether the row is repeated or not:

In [3]:
df_d.duplicated()

0    False
1    False
2    False
3    False
4     True
5     True
dtype: bool

This function is to drop the duplicate records in the DataFrame:

In [4]:
df_d.drop_duplicates()

Unnamed: 0,a,b
0,one,1
1,two,1
2,one,2
3,two,3


 ### Map function:

The following code creates a DataFrame:

In [5]:
df_m = pd.DataFrame({"names":["Olivia","Amelia","Isabelle","Mia","Ella"],
                    "scores":[50,32,67,32,21]})

df_m

Unnamed: 0,names,scores
0,Olivia,50
1,Amelia,32
2,Isabelle,67
3,Mia,32
4,Ella,21


 We can transfer values of data in a DataFrame with the function "map()"
 
 So for example, the name "Olivia" will be mapped to "O" and the name "Amelia" will be mapped to "A".
 
 The code below creates a class of "Names" it will be mapped to:

In [6]:
classes = {"Olivia":"O","Amelia":"A","Isabelle":"I","Mia":"M","Ella":"E"}

 The following code will do the mapping:

In [7]:
df_m["Groupings"] = df_m["names"].map(classes)

df_m

Unnamed: 0,names,scores,Groupings
0,Olivia,50,O
1,Amelia,32,A
2,Isabelle,67,I
3,Mia,32,M
4,Ella,21,E


 ### Replace function:

The following code is the creation of a Series object:

In [8]:
df_r = pd.Series([67,21,79,39])

df_r

0    67
1    21
2    79
3    39
dtype: int64

We can replace values in python using the function "replace()".

This is because the function "replace()" takes in 2 arguments. 

First is the value you want to replace, and Second is the value you would like to replace it with.

The following is a breakdown of the function:

replace(value_to_be_replaced , value_to_replace_to)

 

 The following code is an example that will replace the value "67" with the value "0"

In [9]:
df_r.replace(67,0)

0     0
1    21
2    79
3    39
dtype: int64

 The following code will replace multiple values with the replace() function:

In [10]:
df_r.replace([21,79],[37,38])

0    67
1    37
2    38
3    39
dtype: int64

 ### Rename function:

The following code is a creation of a DataFrame:

In [11]:
df_re = pd.DataFrame(np.arange(12).reshape(3,4), index=[0,1,2], columns=['sam', 'jeslyn', 'kish', 'dan'])

df_re

Unnamed: 0,sam,jeslyn,kish,dan
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


We can rename axes in a dataframe with the help of the rename() function.

We will rename the columns from lowercase to uppercase.

The following code will change column names from lowercase to uppercase:

In [12]:
df_re.rename(columns = str.upper)

Unnamed: 0,SAM,JESLYN,KISH,DAN
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


We can also change the row or column names using the "rename()" function.

Now we will rename the index from "0" to "zero"

Code to replace index:

In [13]:
df_re.rename(index={0:"zero"})

Unnamed: 0,sam,jeslyn,kish,dan
zero,0,1,2,3
1,4,5,6,7
2,8,9,10,11


Now we will rename the column from "sam" to "spade"

Code to replace column:

In [14]:
df_re.rename(columns={"sam":"spade"})

Unnamed: 0,spade,jeslyn,kish,dan
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


 ### Describe function:

The following code generates a DataFrame: 

In [15]:
df_desc = pd.DataFrame(np.random.randn(2000,5))

df_desc

Unnamed: 0,0,1,2,3,4
0,0.440936,1.976199,-0.636206,-1.486093,-2.463065
1,0.232673,1.306515,0.086662,-2.281586,0.420182
2,0.250829,1.673432,0.571872,0.066447,0.808301
3,-1.993899,-0.634533,1.215493,0.366771,-0.249842
4,0.938580,-1.940846,-0.484084,0.119605,2.452507
...,...,...,...,...,...
1995,0.864136,0.639932,0.081496,1.309907,-1.159624
1996,-0.399918,-0.207696,-1.025268,0.536265,0.931253
1997,1.514912,-1.076328,0.346500,0.734496,-0.347668
1998,-0.910903,0.372820,0.564976,-0.524865,-1.418958


We can find specific values/statistical summaries in a dataset.

With the help of the describe() function,  we can get summary statistics of the DataFrame.

In [16]:
df_desc.describe()

Unnamed: 0,0,1,2,3,4
count,2000.0,2000.0,2000.0,2000.0,2000.0
mean,0.033681,0.032669,-0.029145,0.008965,0.00526
std,1.007009,0.99358,0.998771,1.002515,0.998779
min,-2.938243,-3.219191,-3.22008,-3.432308,-4.419935
25%,-0.650524,-0.645993,-0.729852,-0.670728,-0.660022
50%,0.015962,0.063309,-0.044829,0.005806,0.034584
75%,0.699023,0.708509,0.642673,0.649322,0.690868
max,3.772666,3.767783,3.289346,3.229639,3.179816


Here we can see that there is a breakdown of the following:

count (The number of records in the DataFrame)

mean (The average of all values in the DataFrame)

std (Std stands for Standard Deviation, the measure of the variation or dispersion of a set of values.)

min (The minimum value in the DataFrame)

25% (25th Percentile, known as first or lower quartile.)

50% (50th Percentile, known as the median. The median cuts the Data in half)

75% (75th Percentile, known as third or higher quartile)

max (The maximum value in the DataFrame)

### get_dummies()

The following code generates a DataFrame: 

In [17]:
df_d = pd.DataFrame({"Letter":["a","b"]*3,
                    "Number": [0,1,2,3,4,5]})

df_d

Unnamed: 0,Letter,Number
0,a,0
1,b,1
2,a,2
3,b,3
4,a,4
5,b,5


get_dummies() function converts a categorical variable into dummy/indicator variables.

You can read more about this function from the pandas library documentation page:
https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

We can use the "get_dummies()" function to convert a categorical variable into a "dummy" or "indicator.

In [18]:
pd.get_dummies(df_d["Letter"])

Unnamed: 0,a,b
0,1,0
1,0,1
2,1,0
3,0,1
4,1,0
5,0,1


The letter "a" will reflect 1 if it is "a" in the a column.

The letter "b" will reflect 1 if its "b" in the b column.



 ## Quiz time!

 #### Question 1:

 The code for the creation of DataFrame will be given below. Simply just run it and proceed with the other questions.

In [19]:
q1 = pd.DataFrame({"a":["Four","Five"]*3,
                    "b": [4,5,4,5,4,5]})

q1

Unnamed: 0,a,b
0,Four,4
1,Five,5
2,Four,4
3,Five,5
4,Four,4
5,Five,5


 What is the code used to Check for duplicates in the DataFrame stored in the variable "q1"? (Created above)
 
 Please type the code below to CHECK for duplicates:

In [20]:
#Code


 What is the code used to Drop duplicates in the DataFrame stored in the variable "q1"? (Created above)
 
 Please type the code below to DROP duplicates:

In [21]:
#Code


 #### Question 2:

 The code for the creation of DataFrame will be given below. Simply just run it and proceed with the other questions.

In [22]:
q2 = pd.DataFrame({"names":["Kelly","Oliver","Kenneth","Bill","Darren"],
                    "scores":[50,32,67,32,21]})

q2

Unnamed: 0,names,scores
0,Kelly,50
1,Oliver,32
2,Kenneth,67
3,Bill,32
4,Darren,21


 What is the code used to MAP values in a DataFrame?
 
 Please type the code below to MAP values for the following:
 
 - Kelly will be mapped as "K"
 - Olivier will be mapped as "O"
 - Kenneth will be mapped as "K"
 - Bill will be mapped as "B"
 - Darren will be mapped as "D"
 


In [23]:
# Code
classes = {"":"","":""}

Type the code to do the mapping below:

In [24]:
# Code

 #### Question 3:

The code for the creation of a Series object will be given below. Simply just run it and proceed with the other questions.

In [25]:
q3 = pd.Series([93,23,37,99])

q3

0    93
1    23
2    37
3    99
dtype: int64

 What is the code used to REPLACE values in a DataFrame?
 
 Please type the code below to REPLACE values for the following:
 - Value "93" is to be replaced with "21"
 - Value "23" is to be replaced with "22"
 - Value "37" is to be replaced with "23"
 - Value "99" is to be replaced with "24"

Type the code below to replace Value "93" with "21":


In [26]:
# Code

Type the code below to replace Value "23" with "22":


In [27]:
# Code

Type the code below to replace Value "37" with "23":


In [28]:
# Code

Type the code below to replace Value "99" with "24":


In [29]:
# Code

 #### Question 4:

The code for the creation of DataFrame will be given below. Simply just run it and proceed with the other questions.

In [30]:
q4 = pd.DataFrame(np.arange(12).reshape(3,4), index=[0,1,2], columns=['calvin', 'jorddie', 'dom', 'allen'])

q4

Unnamed: 0,calvin,jorddie,dom,allen
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


What is the function that is used to change the column names from lowercase to uppercase?

Key in the code to change the column names from lowercase to UPPERCASE:

In [31]:
# Code

We can also change the row or column names using the "rename()" function.

The function is to be used to rename the following indexes:
- Index "2" to be changed to "Two"

Key in the code to replace the index:

In [32]:
# Code

### 

Renaming the column from "calvin" to "bavier"

Key in the code to replace column:

In [33]:
# Code

 #### Question 5:

The code for the creation of DataFrame will be given below. Simply just run it and proceed with the other questions.

In [34]:
q5 = pd.DataFrame(np.random.randn(2000,5))

q5

Unnamed: 0,0,1,2,3,4
0,0.525176,-0.600748,1.665151,0.085413,0.020242
1,-0.080047,-1.301101,1.755974,0.645166,1.594054
2,-2.336770,-0.564295,-0.614756,-0.444586,-0.700428
3,-1.506837,1.260457,-1.684829,-1.189636,-1.313780
4,-0.455805,0.508072,0.978758,0.098174,1.862498
...,...,...,...,...,...
1995,0.277960,-0.127093,-0.837742,0.043345,0.474758
1996,-1.364586,-0.236346,-0.879023,-0.165789,1.189134
1997,-0.804029,0.807003,1.477181,-0.282601,0.543034
1998,0.864604,1.484600,0.144342,-1.867263,2.162690


What is the function used to find specific values/statistical summaries in a DataFrame?

Key in the code to get the summary statistics of the DataFrame:

In [35]:
# Code

##### Name 2 of the statistic summary and what they mean:

Statistic summary 1:

In [36]:
# Key in answer here

Statistic summary 2:

In [37]:
# Key answer in here

 #### Question 6:

 The code for the creation of DataFrame will be given below. Simply just run it and proceed with the other questions.

In [38]:
q6 = pd.DataFrame({"Letter":["c","d"]*3,
                    "Number": [0,1,2,3,4,5]})

q6

Unnamed: 0,Letter,Number
0,c,0
1,d,1
2,c,2
3,d,3
4,c,4
5,d,5


 What is the function used to create dummies?
 
 Create dummies value with the column "Letter".
 
Key in the code used to create dummies in a DataFrame:

In [39]:
# Code

 # Congratulations on completing this activity!