## CHAPTER 1 PRACTICE SET
---
# DATA WRANGLING

---

Reference: **Machine Learning with Python Cookbook** by Chris Albon, *Chapter 3*

### 1.1 Loading the Data

Import the Pandas library

Load the csv file "titanic" in the "data" folder and assign it to "titanic" (dataframe)

Show the first 4 rows of "titanic"

Show the last 4 rows of "titanic"

### 1.2 Creating a DataFrame

Create a new dataframe and name it "df_scratch". Show "df_scratch"

To "df_scratch":
- Add column 'Name' containing 'Elizabeth Walton', 'Helen Loraine', 'Hudson Creighton', and 'Hudson JC'
- Add column 'Age' containing 29, 2, 30, and 25
- Add column 'Female' containing True, True, False, and True
- Add column 'Survived' containing 1, 0, 0, and 0
- Show "df_scratch"

- Create a series named 'new_row' containing 'Rene Lievens', 24, False, and 0
- Add 'new_row' at the end of "df_scratch"

### 1.3 Describing the Data

Show the first 3 rows of "titanic"

Get the number of rows and columns

Get descriptive statistics for numeric columns

### 1.4 Navigating DataFrames

- Select only the 10th row of *titanic*
- Select rows 14-18 all inclusive
- Select the first 3 rows without using *head()*
- Select the last row
- Set index to "Name" and name the new dataframe *titanic_a*
- In *titanic_a*, find "Zenni, Mr Philip"

Select only the 10th row of titanic

Select rows 14-18 all inclusive

Select the first 3 rows without using head()

Select the last row

- Set index to "Name" and name the new dataframe "titanic_a" 
- In "titanic_a", find 'Zenni, Mr Philip'

**Note:** To select individual rows and slices of rows, pandas provides two methods:
* `loc` is useful when the index of the DataFrame is a label (a string)
* `iloc` works by looking for the position in the DataFrame. For exmaple, iloc[0] will return the first row regardless of whether the index is an integer or a label. 

### 1.5 Selecting Rows on Conditionals

From "titanic", show the first 3 rows where column 'Sex' is 'female'

Show the last 3 rows where column 'Sex' is 'male'

Select all of the rows where passenger is both female and 63 or older

### 1.6 Replacing Values

Replace "female" with "Woman" in the "Sex" column and show the first 3 rows

Replace "female" with "Woman" and "male"  with "Man" in the "Sex" column and show the last 3 rows

Replace 1 with "one" and 0 with "zero" and show rows 53-55 all inclusive

Replace "1st" with "First" using regular expressions and show first 3 rows

### 1.7 Renaming Columns

Rename the "PClass" column "Pass_Class" and show first 3 rows

Rename column "Sex" to "Gender and "Survived" to "Survivor" and show first 3 rows

### 1.8 Finding Basic Statistics

Use the format function to print 
- Maximum, 
- Minimum, 
- Mean, 
- Sum, 
- Standard Error of the Mean, 
- Mode, and 
- Median" for the "Age" column

Use the format function to print 
- Variance, 
- Standard Deviation, 
- Kurtosis, and 
- Skewness for the entire dataframe

Show the count for the entire dataframe

### 1.9 Finding Unique Values

Show unique values in the "Sex" column

Show value counts in the "Sex" column

Show value counts in the "PClass" column

Show the number of unique values in the "PClass" column

### 1.10 Handling Missing Values

Select missing values in "Age" column and show first 2 rows

- Import NumPy, 
- Replace "male" in the "Sex" column with "NaN" and show first 3 rows 

### 1.11 Deleting a Column

Drop the "Age" column and show 3 rows

Drop the "Survived" and "SexCode" columns and show 2 rows

Show columns in "titanic"

- Create a new dataframe "titanic_b" that is *titanic* without the second column
- Show the first 3 rows of "titanic_b"

### 1.12 Deleting a Row

In "titanic", drop the first two rows and show the first 3 rows

Delete all rows where "Sex" is "male"

Delete the row where the "Name" is "Allison, Miss Helen Loraine"

Delete the row with index 2

### 1.13 Dropping Duplicate Rows

- Drop duplicates in "titanic" and assign it to a new dataframe named "titanic_c"
- Print the number of rows in "titanic" and "titanic_c" side by side

- Drop duplicates in "Age" column of "titanic" and assign it to a new dataframe "titanic_c"
- Print the number of rows in "titanic" and "titanic_c" side by side

Show the first two rows of "titanic_c"

- Drop duplicates in "Age" column of "titanic", while keeping 'last', and assign it to a new dataframe "titanic_d".
- Show the first two rows of "titanic_d"

In "titanic_d", check whether a row is duplicated or not in the "Sex" column and show the first 4 rows

### 1.14 Grouping Rows by Values

In "titanic", group by "Sex" and count the number of people in each category

Group by "Sex" and count the number of "Name" in each category

Group by "Sex, PClass, Survived" and count the number values in each category

Group by "Sex, PClass, Survived" and find the mean "Age" in each category

### 1.15 Grouping Rows by Time

Our *titanic* dataset has no datetime column so let's create a new dataframe for practice purposes only. We need Pandas and NumPy libraries for this, which are already loaded:

Create a date range (named *time_index*) starting from 06/06/2017 with 100,000 periods that are 30 seconds apart from each other

Create a dataframe named *time_df* and set its index to *time_index*. Show the first 3 rows

Create a new column "Sales_Amount" containing 100,000 random integers between 1 and 10. Show first 3 rows

Group rows by week and calculate the sum per week

Group rows by 3 weeks and calculate the mean per week

Group rows by month and show the number of sales per month

Group rows by month and show the number of sales per month but label dates on left

### 1.16 Looping Over a Column

Back to our *titanic* dataframe: Show the first 3 rows of "Name" column only

Use *for...loop* to print the 3 names in uppercase

Use list comprehension to print the 3 names in uppercase

### 1.17 Applying a Function to Columns and Groups

Create a function "uppercase" that takes in one argument x and return x in uppercase. Try it on the string 'test'

Apply the "uppercase" function on the entire "Name" column and show the first 4 rows

Group rows by "Sex", then "PClass", and "Survived" and a apply a lambda function to count on "Name" column

### 1.18 Concatenating DataFrames

- Create a dictionary "dict_a" with the following keys and corresponding values:
    1. key is 'id', values are '1', '2', '3'
    2. key is 'first', values are 'Alex', 'Amy', 'Allen'
    3. key is 'last', values are 'Anderson', 'Ackerman', 'Ali'
- Create a list "cols" whose values are the keys to "dict_a"
- Create a new dataframe "df_a" whose values are contents of "dict_a" and columns are contents of "cols"
- Show "df_a"

- Create a dictionary "dict_b" with the following keys and corresponding values:
    1. key is 'id', values are '4', '5', '6',
    2. key is 'first', values are 'Billy', 'Brian', 'Bran',
    3. key is 'last', values are 'Bonder', 'Black', 'Balwner'
- Create a new dataframe "df_b" whose values are contents of "dict_b" and columns are contents of "cols"
- Show "df_b"

Concatenate "df_a" and "df_b" by rows

Concatenate "df_a" and "df_b" by rows but ignore index

Concatenate "df_a" and "df_b" by columns

Concatenate "df_a" and "df_b" by columns but ignore index

- Create a series named "row_a" containing *10, "Chris", "Chillon"* whose index is the contents of "cols"
- Append "row_a" to "df_a" and ignore index

### 1.19 Merging DataFrames

- Create a dictionary "dict_c" with the following keys and corresponding values:
    1. key is 'employee_id', values are '1', '2', '3', '4'
    2. key is 'name', values are 'Amy Jones', 'Allen Keys', 'Alice Bees','Tim Horton'
- Create a new dataframe "df_c" whose values are contents of "dict_c" and columns are 'employee_id' and 'name'
- Show "df_c"

- Create a dictionary "dict_d" with the following keys and corresponding values:
    1. key is 'employee_id', values are '3', '4', '5', '6',
    2. key is 'total_sales', values are 23456, 2512, 2345, 1455
- Create a new dataframe "df_d" whose values are contents of "dict_d" and columns are 'employee_id' and 'total_sales'
- Show "df_d"

Merge "df_c" with "df_d" on 'employee_id'

Do an outer merge of "df_c" and "df_d" on 'employee_id'

Do a left merge of "df_c" and "df_d" on 'employee_id'

Merge "df_c" with "df_d" left on 'employee_id' and right on 'employee_id'

Merge "df_c" with "df_d" with left index and right index true