# Lab 4B

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Task 1: Setting the Aesthetics for the plots (2 marks)


### 1.1: Set the Seaborn figure theme and scale up the text in the figures (2 marks)

There are five preset Seaborn themes: `darkgrid`, `whitegrid`, `dark`, `white`, and `ticks`. 
They are each suited to different applications and personal preferences.
You can see what they look like [here](https://seaborn.pydata.org/tutorial/aesthetics.html#seaborn-figure-styles).

Hint: You will need to use the `font_scale` property of the `set_theme()` function in Seaborn.

In [3]:
# Your Solution Here

sns.set_style("darkgrid")
sns.set_theme(font_scale = 2)

## Task 2: Exploratory Data Analysis (40 marks)

### 2.1. Describe your dataset (2 marks)

Consider the following questions to guide you in your exploration:

- Who: Which company/agency/organization provided this data?
- What: What is in your data?
- When: When was your data collected (for example, for which years)?
- Why: What is the purpose of your dataset? Is it for transparency/accountability, public interest, fun, learning, etc...
- How: How was your data collected? Was it a human collecting the data? Historical records digitized? Server logs?

**Hint: The [pokemon dataset is from this Kaggle page.](https://www.kaggle.com/rounakbanik/pokemon)**

*Hint: You probably will not need more than 250 words to describe your dataset. All the questions above do not need to be answered, it's more to guide your exploration and think a little bit about the context of your data. It is also possible you will not know the answers to some of the questions above, that is FINE - data scientists are often faced with the challenge of analyzing data from unknown sources. Do your best, acknowledge the limitations of your data as well as your understanding of it. Also, make it clear what you're speculating about. For example, "I speculate that the {...column_name...} column must be related to {....} because {....}."*

### 2.2. Load the dataset from a file, or URL (1 mark)

This needs to be a pandas dataframe. Remember that others may be running your jupyter notebook so it's important that the data is accessible to them. If your dataset isn't accessible as a URL, make sure to commit it into your repo. If your dataset is too large to commit (>100 MB), and it's not possible to get a URL to it, you should contact your instructor for advice.

You can use this URL to load the data: https://github.com/firasm/bits/raw/master/pokemon.csv

In [4]:
# Your solution here

df = pd.read_csv("data/autoscout24-germany-dataset.csv")
df.head()

Unnamed: 0,mileage,make,model,fuel,gear,offerType,price,hp,year
0,235000,BMW,316,Diesel,Manual,Used,6800,116.0,2011
1,92800,Volkswagen,Golf,Gasoline,Manual,Used,6877,122.0,2011
2,149300,SEAT,Exeo,Gasoline,Manual,Used,6900,160.0,2011
3,96200,Renault,Megane,Gasoline,Manual,Used,6950,110.0,2011
4,156000,Peugeot,308,Gasoline,Manual,Used,6950,156.0,2011


### 2.3. Explore your dataset (3 marks)

Which of your columns are interesting/relevant? Remember to take some notes on your observations, you'll need them for the next EDA step (initial thoughts).

#### Initial Thoughts

Initially, I am thinking that the price, mileage, and hp columns will be the most interesting. Although all columns will effect the price, I am most interested in generalizing beyond a specific make or model; though that also sounds interesting. To be honest I am currently not sure what columns will end up being the most interesting, I see strong potential in each of them

#### 2.3.1:  You should start with [`df.describe().T`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) (2 marks)

See [linked documentation]((https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) for the use of `include`/`exclude` to look at numerical and categorical data.

In [5]:
# Your solution to output `df.describe.T` for numerical columns:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
mileage,46405.0,71177.864109,62625.308456,0.0,19800.0,60000.0,105000.0,1111111.0
price,46405.0,16572.337227,19304.695924,1100.0,7490.0,10999.0,19490.0,1199900.0
hp,46376.0,132.990987,75.449284,1.0,86.0,116.0,150.0,850.0
year,46405.0,2016.012951,3.155214,2011.0,2013.0,2016.0,2019.0,2021.0


In [6]:
# Your solution to output `df.describe.T` for categorical columns:
display(df.describe(include='all').T.drop(["mileage", "price", "hp", "year"]).drop(["mean","std","min","25%","50%","75%","max"], axis="columns"))


display(df.describe(exclude=[np.number]))

Unnamed: 0,count,unique,top,freq
make,46405,77,Volkswagen,6931
model,46262,841,Golf,1492
fuel,46405,11,Gasoline,28864
gear,46223,3,Manual,30380
offerType,46405,5,Used,40122


Unnamed: 0,make,model,fuel,gear,offerType
count,46405,46262,46405,46223,46405
unique,77,841,11,3,5
top,Volkswagen,Golf,Gasoline,Manual,Used
freq,6931,1492,28864,30380,40122


#### 2.3.2 Let's try `pandas_profiling` now. (1 mark)

**Hint: To install the [`pandas_profiling`](https://towardsdatascience.com/exploratory-data-analysis-with-pandas-profiling-de3aae2ddff3) package, you'll need to use `conda`:**

- `conda install -c conda-forge pandas-profiling`

In [9]:
import pandas_profiling

# Your solution for `pandas_profiling`

profile = pandas_profiling.ProfileReport(df)
profile.to_file(output_file='output.html')

Summarize dataset:   0%|          | 0/22 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### 2.4. Initial Thoughts (2 marks)

#### 2.4.1. Use this section to record your observations. (2 marks)

Does anything jump out at you as surprising or particularly interesting? 

Where do you think you'll go with exploring this dataset? Feel free to take notes in this section and use it as a scratch pad.

Any content in this area will only be marked for effort and completeness.

#### # Your observations here:

- Looking at the correlation plots generated, I am suprised how hp and year have the largest correlation with price.
- It apears that there is little correlation between mileage and price of a vehicle, this is counter intuitive and needs to be further explored
- dendorgram chart is very intersting, it seems as gear plays a large part in the price, as price influences model (upscale brands offer automatic) then model influences hp (upscale brands often have larger engines), which as discussed may correlate to a higher price.
- Side question; there is more Volkswagen cars sold then any other, does a car brands origin influence how many are sold? Ford ranked 3rd, so not obvious.

### 2.5. Wrangling (5 marks)

The next step is to wrangle your data based on your initial explorations. Normally, by this point, you have some idea of what your research question will be, and that will help you narrow and focus your dataset. 

In this lab, we will guide you through some wrangling tasks with this dataset.

#### 2.5.1. Drop the 'Generation', 'Sp. Atk', 'Sp. Def', 'Total', and the '#' columns (1 mark)

In [6]:
# Your solution here

#### 2.5.2. Drop any NaN values in HP, Attack, Defense, Speed (1 mark)

In [7]:
# Your solution here

#### 2.5.3. Reset the index to get a new index without missing values (1 mark)

In [8]:
# Your solution here

#### 2.5.4. A new column was added called `index`; remove it. (1 mark)

In [9]:
# Your solution here

#### 2.5.5. Calculate a new column called "Weighted Score" that computes an aggregate score comprising:

- 20% 'HP'
- 40% 'Attack'
- 30% 'Defense'
- 10% 'Speed'

**(1 mark)**

In [10]:
# Your solution here

### 2.6. Research questions (2 marks)

#### 2.6.1 Come up with at least two research questions about your dataset that will require data visualizations to help answer. (2 marks)

Recall that for this purpose, you should only aim for "Descriptive" or "Exploratory" research questions.

**Hint1: You are welcome to calculate any columns that you think might be useful to answer the question (or re-add dropped columns like 'Generation', 'Sp. Atk', 'Sp. Def'.***

**Hint2: Try not to overthink this; this is a toy dataset about Pokémon, you're not going to solve climate change or cure world hunger. Focus your research questions on the various Pokémon attributes, and the types.**

#### # Your solution here: 

**1. Sample Research Question:** Which Pokemon Types are the best, as determined by the Weighted Score?

**2. Your RQ 1:**

**3. Your RQ 2:**



### 2.7. Data Analysis and Visualizations

#### 2.7.1. **Sample Research Question:** Which Pokemon Types are the best, as determined by the Weighted Score? (3 marks)

To answer this question, we will first need to do wrangle the data to return the mean Weighted_Score, split by the Pokemon Type 1. 

Here is the goal of this analysis:

<img src="groupby.png" width="200px">


In [11]:
# Your Solution here

#### 2.7.2. Create a violin plot to show the distribution of Weighted_Scores split by all the Pokémon types. (2 marks)

Here is the goal:

<img src="violin.png" width="350px">

In [12]:
# Your Solution here

#### 2.7.3. Create a Box Plot and overlay a strip plot (2 marks)

Here is the goal:

<img src="BoxPlot.png" width="350px">

In [13]:
# Your Solution here 

#### 2.7.4. Create a [Hexbin plot with marginal distributions](http://seaborn.pydata.org/generated/seaborn.jointplot.html) (2 marks)

This plot helps you visualize large amounts of data (and its distributions) by using colours to represent the number of points in a hexagonal shape.

Here is the goal:

<img src="jointplot.png" width="350px">

In [14]:
# Your Solution here 

#### 2.7.5. Create a [PairPlot](https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot) of the quantiative features of the pokémon dataset (1 mark)

Here is the goal:

<img src="pairplot.png" width="350px">

In [15]:
# Your Solution here

#### 2.7.6. Create a visualization that helps you answer your first research question (2 marks)

In [17]:
# Your Solution here

#### 2.7.7. Create a visualization that helps you answer your second research question (2 marks)

In [18]:
# Your Solution here

### 2.8. Summary and conclusions (3 marks)

#### 2.8.1. Summarize your findings and describe any conclusions and insight you were able to draw from your visualizations. (3 marks)

- **Sample Research Question:** Which Pokemon Types are the best, as determined by the Weighted Score? (3 marks)

    - Summary of findings, insight, and conclusions
    - ..
    

- **Research Question 1:** RQ here

    - Summary of findings, insight, and conclusions
    - ..
    

- **Research Question 2:** RQ here

    - Summary of findings, insight, and conclusions
    - ..

## Task 3. Method Chaining (8 marks)

Method chaining allows you to apply multiple processing steps to your dataframe in a fewer lines of code so it is more readable. You should avoid having too many methods in your chain, as the more you have in a single chain, the harder it is to debug or troubleshoot. I would target about 5 methods in a chain, though this is a flexible suggestion and you should do what makes your analysis the most readable and group your chains based on their purpose (e.g., loading/cleaning, processing, etc…).

**Note: See Milestone 2 for a more thorough description of method chaining.**

#### 3.1. Use Method Chaining on the commands from sections B5.1, B5.2, B5.3, B5.4, B5.5. (4 marks)

In [17]:
# Your Solution here

#### 3.2. Use Method Chaining to do the tasks below. (4 marks)

1. Remove all Pokémon 6th generation and above.
2. Remove the Legendary column.
3. Remove all rows that contain "Forme", a special form of Pokémon.
4. Remove all rows that contain "Mega", another weird special form of Pokémon.

**Hint: You will need to use the [.loc](https://towardsdatascience.com/effective-data-filtering-in-pandas-using-loc-40eb815455b6) in combination with the anonymous function lambda.**

In [18]:
# Your Solution here

## Task 4. (OPTIONAL) Advanced Visualizations 

### 4.1. Create a ["Ridgeline"](https://seaborn.pydata.org/examples/kde_ridgeplot.html) plot fron the plots from B7.2 and B7.3 (2 marks).

Here is the goal:

<img src="ridgeline.png" width="300px">

In [19]:
# Your Solution Here