<a href="https://colab.research.google.com/github/rg326/data_science/blob/main/COOP_notebooks/LM_Copy_of_Lesson_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://tinyurl.com/k2t79s6t" style="float: left; margin: 20px; height: 55px">
 
# Basic Elementary Exploratory Data Analysis using Pandas

_Author: Christopher Chan_

### Objective

Upon completion of this lesson you should be able to understand the following:

1. Pandas library
2. Dataframes
3. Data selection
4. Data manipulation
5. Handling of missing data

This is arguably the most important part of analysis. This is also referred to as the "cleaning the data". Data must be usable for it to a valid analysis. Otherwise it would be garbage in, garbage out.

##### ==================================================================================================
## Data Selection and Inspection


### Pandas Library

`pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

`pandas` data frame can be created by loading the data from the external, existing storage like a database, SQL, or CSV files. But the Pandas Data Frame can also be created from the lists, dictionary, etc. For simplicity, we will use `.csv` files. One of the ways to create a pandas data frame is shown below:

### DataFrames
A data frame is a structured representation of data.
##### ==================================================================================================

In [1]:
import pandas as pd

In [2]:
data = {'Name':['John', 'Tiffany', 'Chris', 'Winnie', 'David'],
        'Age': [24, 23, 22, 19, 10], 
        'Salary': [60000,120000,1000000,75000,80000]}

people_df = pd.DataFrame(data)

##### ==================================================================================================
We can call on the dataframe we labeled `people_df` by applying the `.head()` function that would display the first five rows of the dataframe. Similarly, the `.tail()` function would return the last five rows of a dataframe.

In [3]:
people_df.head()

Unnamed: 0,Name,Age,Salary
0,John,24,60000
1,Tiffany,23,120000
2,Chris,22,1000000
3,Winnie,19,75000
4,David,10,80000


##### ==================================================================================================
We can also modify the number of rows we would like to display by inserting the integer into the `.head()` function.

Example: Select the first 2 rows of the dataframe

In [4]:
people_df.head(2)

Unnamed: 0,Name,Age,Salary
0,John,24,60000
1,Tiffany,23,120000


Example: Select the last 2 rows of the dataframe

In [5]:
people_df.tail(2)

Unnamed: 0,Name,Age,Salary
3,Winnie,19,75000
4,David,10,80000


##### ==================================================================================================
Another way to create a dataframe would be to load an existing CSV file by using the `read_csv` function built into `pandas` onto the desired file path as shown below:

`dataframe = pd.read_csv(".../file_location/file_name.csv")`

In [6]:
movies_df = pd.read_csv("/content/Pixar_Movies.csv")

FileNotFoundError: ignored

##### ==================================================================================================

In [None]:
movies_df.head(10)

#### The above python code is equivalent to SQL's

```sql
SELECT * 
FROM Movies
LIMIT 10
```
##### ==================================================================================================

`.shape` shows the number of rows and columns

In [None]:
movies_df.shape

This shows us how many rows and columns are in the entire dataframe, 14 rows, 5 columns

##### ==================================================================================================

`.dtypes` shows the data types

In [None]:
movies_df.dtypes

`.describe()` can be used to help summarize numerical data in our dataframe. It summarizes the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [None]:
movies_df.describe()

You may optionally include categorical data in the `describe` method like so:

In [None]:
movies_df.describe(include='all')

In [None]:
movies_df.info()

##### ==================================================================================================

### Row and Column Selection

There are two common ways to select rows and columns in a dataframe using .loc and .iloc

`.loc` selects rows and columns by label/name

`.iloc` selects row and columns by index

Example: using `.loc` to select every row in the dataframe by using `:` and filtering the column to just Title, Director and Year

In [None]:
movies_df.loc[2:4, ['Title','Director','Year'] ]

##### ==================================================================================================

Similarly we obtain the same results using `'iloc` by filtering the columns to the 1, 2, and 3 column that correspond to as Title, Director and Year respectively as shown below:

In [None]:
movies_df.iloc[ :, [1,2,3] ]

#### The two python codes above are equivalent to SQL's

```sql
SELECT Title, Director, Year
FROM Movies
```

##### ==================================================================================================

In [None]:
movies_df.iloc[0:3,[1,2,3]]

#### The above python code is equivalent to SQL's

```sql
SELECT Title, Director, Year
FROM Movies
LIMIT 3
```
##### ==================================================================================================

In [None]:
movies_df.iloc[2:5, [1,2,3]]

#### The above python code is equivalent to SQL's

```sql
SELECT Title, Director, Year
FROM movies
LIMIT 3
OFFSET 2
```
##### ==================================================================================================

The `value_counts()` method returns the count of unique values in a given `Series`/column. For example, let's look at the number of entries each Director has in `movies_df`:

In [None]:
movies_df.loc[:,'Director'].value_counts()

#### The above python code is equivalent to SQL's
```sql
SELECT Director, COUNT(*)
FROM Movies
GROUP BY Director
```


##### ==================================================================================================

We can use the `mean()` method to help us find the average of a column or group of columns.

In [None]:
movies_df.loc[:, 'Length_minutes'].mean()

#### The above python code is equivalent to SQL's
```sql
SELECT AVG(Length_minutes)
FROM Movies
```

Using the `groupby()` method, we can perform operations that are similar to the `GROUP BY` clause in SQL.

For example, let's get the average `Length_minutes` by `Director` to see the average number of minutes for each Director's movies:

In [None]:
movies_df.loc[:, ['Director', 'Length_minutes']].groupby('Director').mean()

#### The above python code is equivalent to SQL's
```sql
SELECT Director, AVG(Length_minutes) AS Length_minutes
FROM Movies
GROUP BY Director
```

##### ==================================================================================================
### Filtering Data
Using operator comparisons on columns returns information based on our desired conditions

Example: Suppose we want to return movie information if it is only longer than 100 minutes long.

In [7]:
# Create the filter 
movie_filter = movies_df.loc[:, "Length_minutes"] > 100
# Use the filter in the `.loc` selector
movies_df.loc[movie_filter, :]

# An example showing everything in a single step 
movies_df.loc[movies_df.loc[:, "Length_minutes"] > 100, :]


NameError: ignored

#### The above python code is equivalent to SQL's
```sql
SELECT *
FROM Movies
WHERE Length_minutes > 100
```
##### ==================================================================================================

#### Multiple Conditional Filtering

Supposed we want to return movie information only if it is longer than 100 minutes and was created before the year 2005

In [None]:
movie_len_filter = movies_df.loc[:, "Length_minutes"] > 100
movie_year_filter = movies_df.loc[:, "Year"] < 2005
movies_df.loc[(movie_len_filter) & (movie_year_filter), :]

#### The above python code is equivalent to SQL's
```sql
SELECT *
FROM Movies
WHERE Length_minutes > 100
AND Year < 2005
```
##### ==================================================================================================

##### ==================================================================================================
### Sorting Data
The `sort_values()` method sorts the list ascending by default. To sort by descending order, you must apply `ascending = False`. 

The `.reset_index(drop=True)` will re-index the index after sorting.

In [None]:
movies_df.loc[:,"Title"].sort_values().reset_index(drop=True)

#### The above python code is equivalent to SQL's

```sql
SELECT Title
FROM Movies
ORDER BY Title
```
##### ==================================================================================================

Sort the entire dataframe by a single column:

In [None]:
movies_df.sort_values("Title").reset_index(drop=True)

#### The above python code is equivalent to SQL's
```sql
SELECT *
FROM Movies
ORDER BY Title
```
##### ==================================================================================================

We can also sort using multiple columns.
Example: We can sort by Director first, then within each Director, sort the Title of the films.

In [None]:
movies_df.sort_values(["Director","Title"], ascending=[True, False]).reset_index(drop=True)

##### ==================================================================================================
### Merging DataFrames

In python the `.concat` function combines dataframes together. This can be either one on top of another dataframe or side by side.

But first let us introduce a new dataset:

In [None]:
other_movies_df = pd.read_csv("Other_Movies.csv")

In [None]:
other_movies_df.head()

##### ==================================================================================================
Now lets combine the two dataframes, that being `movies_df` and `other_movies_df` using the `.concat` function and call this new dataframe `all_movies_df`

In [None]:
all_movies_df = pd.concat([movies_df,other_movies_df]).reset_index(drop=True)

In [None]:
all_movies_df.head(-1) # Using -1 in the head function will show us all of the rows

##### ==================================================================================================
Now lets introduce another dataframe, that being the movie scores received

In [None]:
scores_df = pd.read_csv("Movie_Scores.csv")

In [None]:
scores_df.head()

##### ==================================================================================================
Now we can combine the two dataframes side by side

In [None]:
movies_and_scores_df = pd.concat([all_movies_df,scores_df], axis = "columns").reset_index(drop=True)

In [None]:
movies_and_scores_df.head(-1)

##### ==================================================================================================



In [None]:
managers = pd.DataFrame(
    {
    'Id': [1,2,3],
    'Manager':['Chris','Maritza','Jamin']
    }
)

In [None]:
managers.head()

In [None]:
captains = pd.DataFrame(
    {
    'Id': [2,2,3,1,1,3,2,3,1,1,3,3],
    'Captain':['Derick','Shane','Becca','Anna','Christine','Melody','Tom','Eric','Naomi','Angelina','Nancy','Richard'],
    'Title':['C','C','SC','C','SC','C','C','SC','C','EC','C','SC']
    }
)

In [None]:
captains.head(12)

In [None]:
roster = captains.merge(managers,left_on = 'Id', right_on = 'Id')
roster.head(-1)

In [None]:
test_roster = pd.concat([captains, managers], axis="columns").reset_index(drop=True)
test_roster.head()

#### The above python code is equivalent to SQL's
```sql
SELECT *
FROM Captains
INNER JOIN Managers
ON Captains.Id = Managers.Id
```
##### ==================================================================================================
## Column Renaming

We can use the `.rename` function in python to relabel the columns of a dataframe. Suppose we want to rename `Id` to `Cohort` and `Title` to `Captain Rank`.

In [None]:
roster = roster.rename(columns = {"Id":"Cohort","Title":"Captain Rank"})
roster.head(-1)

In [None]:
roster.columns

If we would like to replace all columns, we must use a list of equal length

In [None]:
roster.columns = ['Cohort Num','Capt','Capt Rank','Manager']
roster.head(-1)

##### ==================================================================================================
### Drop Columns

In [None]:
#df.drop(["column1","column2"], axis = "columns")

roster = roster.drop("Cohort Num", axis = "columns")
roster.head(-1)

##### ==================================================================================================
### Missing Values / NaN Values

There are various types of missing data. Most commonly it could just be data was never collected, the data was handled incorrectly or null valued entry.

Missing data can be remedied by the following:
1. Removing the row with the missing/NaN values
2. Removing the column with the missing/NaN values
3. Filling in the missing data

For simplicity, we will only focus on the first two methods. The third method can be resolved with value interpolation by use of information from other rows or columns of the dataset. This process requires knowledge outside of the scope of this lesson. There are entire studies dedicated to this topic alone.

In [None]:
cars = pd.read_csv("Cars.csv")
cars.head(-1)

##### ==================================================================================================
Now lets sort the companies in alphabetical order

In [None]:
cars = cars.sort_values("Company").reset_index(drop=True)
cars.head(-1)

##### ==================================================================================================
Now lets check how many entry points are missing. As we can see there are 4 entries in the Location column and 5 entries missing in the Year column.

In [None]:
cars.isna().sum()

##### ==================================================================================================
Lets inspect all the rows with any missing Loctation entries

In [None]:
missing_car_info_filter = cars.loc[:, "Location"].isna()
cars.loc[missing_car_info_filter, :]

##### ==================================================================================================
Lets inspect all the rows with any missing Year entries

In [None]:
cars.loc[cars.loc[:, "Year"].isna(), :]

##### ==================================================================================================
For simplicity we can fill all the missing Location entries with "NA"

In [None]:
cars.loc[:, "Location"] = cars.loc[:, "Location"].fillna(value="NA")

In [None]:
cars.head(-1)

##### ==================================================================================================
Now lets drop any rows with missing entries

In [None]:
cars = cars.dropna().reset_index(drop=True)
cars.head(-1)

In [None]:
cars.info()

##### ==================================================================================================
## Summary

- `pandas` provides `Series` and `DataFrame` classes that with tabular style data.
- `.loc` selects rows and columns based on their index values.
- `.iloc` selects rows and columns based on their position values.
- Calling a DataFrame method with `axis="rows"` or `axis=0` causes it to operate along the row axis.
- Calling a DataFrame method with `axis="columns"` or `axis=1` causes it to operate along the columns axis.
- `sort_values` reorders rows based on condition
- `.rename()` can rename columns in DataFrames. You can also rewrite the `.columns` attribute to rename columns.
- `.isna()` detects missing values
- `.fillna()` replaces NULL values with a specified value
- `.dropna()` removes all rows that contain NULL values
- `.merge()` updates content from one DataFrame with content from another Dataframe

##### ==================================================================================================
### Exercise 1:
Create a new DataFrame called `cohort` by inner joining the two DataFrames `roster` and `exam`

In [None]:
#solution
roster = pd.DataFrame(
{
    "Name" : ["James","Greg","Patrick","Chris","Cynthia","Chandra", "John","David","Tiffany","Peter"],
    "Id": ["1","2","3","4","5","6","7","8","9","10"],
    
})

exam = pd.DataFrame({
    "Exam 1" : [89,78,81,90,93,76,66,87,42,55],
    "Exam 2" : [100,74,20,86,60,76,92,97,88,90],
    "Exam 3" : [85,60,90,90,88,76,55,None,64,79],
    "Id" : ["4","2","1","7","5","10","6","3","9","8"]
})


cohort = pd.merge(roster, exam, on = "Id")

##### ==================================================================================================
### Exercise 2:
Fill all missing grades with 0.

In [None]:
cohort.isna()

In [None]:
cohort.describe()

In [None]:

cohort.fillna(value = 0)

##### ==================================================================================================
### Exercise 3:
Update James Exam 2 score from 20 to 85 and update Tiffany Exam 1 score from 42 to 88

In [None]:
# YOUR CODE HERE

##### ==================================================================================================
### Exercise 4:

Create a series called `Average` that takes the average of Exam 1, Exam 2 and Exam 3 scores

In [None]:
# YOUR CODE HERE

##### ==================================================================================================
### Exercise 5:
Incorporate the newly created `Average` column into the DataFrame `cohort`

In [None]:
# YOUR CODE HERE

##### ==================================================================================================
### Exercise 6:
Sort the dataset by Average in **descending** order and reindex the DataFrame

In [None]:
# YOUR CODE HERE

##### ==================================================================================================
### Exercise 7:
Drop columns Exam 1, 2, and 3

In [None]:
# YOUR CODE HERE

##### ==================================================================================================
### Exercise 8:
Select only the top 3 **Name, Id and Average only*** based on highest Average grade

In [None]:
# YOUR CODE HERE