<img src=images/gdd-logo.png align=right width=300px>

# Self Study - Pandas 1

Welcome to the Pandas independent study part 1! Here you will review some of the key methods used in Pandas to perform simple data exploration as well as selecting and filtering data.

To practice these skills we will be using data about the Netflix shows available on US Netflix.

- [About the data](#about)
- [Data Exploration](#de)
    - [<mark>Exercises</mark>](#de-ex)
    - [<mark>Answers</mark>](#de-an)
- [Selecting rows and columns](#se)
    - [<mark>Exercises</mark>](#se-ex)
    - [<mark>Answers</mark>](#se-an)
- [Filtering](#fi)
    - [<mark>Exercises</mark>](#fi-ex)
    - [<mark>Answers</mark>](#fi-an)

<img src=images/netflix.png align=right width=200px>

<a id='about'></a>

## About the data

The data is shared through the CC0: Public Domain via [Kaggle](https://www.kaggle.com/shivamb/netflix-shows/version/5). 

This dataset has the listings of movies and tv shows available on US Netflix, along with details such as - cast, directors, ratings, release year, duration, etc.

In [None]:
import pandas as pd

In [None]:
netflix = pd.read_csv('data/netflix.csv')
netflix.head()

--- 
<a id='de'></a>

## Data Exploration

Data Exploration is a really important step in any data analysis. You should use it to get to know your data and answer any questions you might have about the data.

<a id='de-ex'></a>

### <mark>Exercises</mark>

Use the following methods to answer the questions below and come up with one (simple) question and answer it using pandas methods.

- `df.shape`
- `df.info()`
- `df.describe()`
- `df['col'].unique()`
- `df['col'].value_counts()`

Code and raw cells have been provided to write your code in and to write an answer to each question.

1. How many rows and columns are there in the dataframe?

In [None]:
# code goes here


2. There are 5 columns that have missing values, what are they?

In [None]:
# code goes here


3. In what year was the oldest thing on Netflix released?

In [None]:
# code goes here


4. How many different types of ratings are there?

In [None]:
# code goes here


5. Which is the second highest countries for the most shows/movies released?

In [None]:
# code goes here


<a id='de-an'></a>

### <mark>Answers to Part 1</mark>

In [None]:
%load answers/data-exploration-1.py

In [None]:
%load answers/data-exploration-2.py

In [None]:
%load answers/data-exploration-3.py

In [None]:
%load answers/data-exploration-4.py

In [None]:
%load answers/data-exploration-5.py

---
<a id='se'></a>
## Selections

When working with data you often need to find a subset of the data. 

You can easily select the top/bottom of the data using `df.head()` and `df.tail()`.

In [None]:
netflix.head()

In [None]:
netflix.tail()

Before doing this, it makes sense to sort the data so that you are selecting a particular subset of the data. 

In [None]:
(
    netflix
    .sort_values('director')
    .head()
)

As well as this you can use the `.loc[]` method to select a slice of the rows, using the index values.

In [None]:
netflix.loc[300:305]

### <mark>Exercises</mark>

1. Select the first 5 films after sorting the `title` alphabetically. All these films begin with the same character, what is it?

In [None]:
# code goes here


Answer:

2. Change the number in the following code (`8500`) until you only have rows containing films that begin with `z`, they should be the last rows of the data. What number is the lowest possible number you can have?

In [None]:
# code goes here


3. Find the 5 oldest films that are available on Netflix. There are 2 directors responsible for these films (and one unknown). Who are these directors?

In [None]:
# code goes here


### <mark>Answers</mark>

In [None]:
%load answers/selections-1.py

In [None]:
%load answers/selections-2.py

In [None]:
%load answers/selections-3.py

---

<a id='fi'></a>
## Filtering

Now we are going to use the lambda function to do some filtering. 

Remember that to filter we use the following syntax.

```python
dataframe.loc[lambda df: df['col']=='value']
```
Where...
- dataframe is the name of the data (in this case netflix):
- `'col'` is the column we want to filter
- `==` is the comparison, also available are: `!=`, `<`, `>`, `<=`, `>=`
- `'value'` is how we want to filter, this could be a string or a number

Below is an example using the netflix data:

In [None]:
netflix.loc[lambda df: df['show_id'] == 's6240']

<a id='fi-ex'></a>

### <mark>Exercises</mark>

1. Filter the data to only show TV Shows. How many TV Shows are there?

In [None]:
# code goes here


2. In the questions above you found the earliest release date. Use this number to filter the data and find out what show on Netflix was released in that year. You should find that the show was added on `December 30, 2018`, are you correct?

In [None]:
# code goes here


3. In 2021 the most watched TV Show was a show that has `12 Seasons` on Netflix, and was added to Netflix on `June 30, 2017`. What was the show?

In [None]:
# code goes here


4. Your friend watched a film the other day and wants to recommend it, but can't remember the name. This is the information he remembers, can you find it?

    1. The move was **listed in** `Comedies, Independent Movies`
    2. The movie was definitely **released** after `2010` then added to Netflix later
    3. The **rating** for the movie was `R`
    4. The **title** of the movie only had **one word***
    5. The **director was also the leading actor** in the movie*

In [None]:
# code goes here


**NOTE**: Part D* & E* are optional to perform in pandas, as you can use visual inspection after you have filtered for the first three parts. However you can do it if you'd like a challenge! Run the below cell if you would like hints on how to do this!

In [None]:
%load answers/hint-filtering-partD.py

In [None]:
%load answers/hint-filtering-partE.py

<a id='fi-an'></a>
### Answers to Filtering

In [None]:
%load answers/filtering-1.py

In [None]:
%load answers/filtering-2.py

In [None]:
%load answers/filtering-3.py

In [None]:
%load answers/filtering-4.py