In [None]:
import pandas as pd
import matplotlib.pyplot as plt

[`pandas` Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

# Loading the data

In [None]:
df = pd.read_csv('https://drive.switch.ch/index.php/s/UEpTFv2Bfa5C1dd/download')
df.head()

We repeat our simple data cleaning here, by getting rid of all `NaN` values.

In [None]:
df = df.dropna()

# Task 0 - Repeat

Calculate the average number of yellow and red cards per game for each country.

*Hint*: If you don't remember how to start, check the **Group Data** section in the cheat sheet.

# Task 1.1 - Scatter Plot
Create a Scatter Plot of player weight vs. height.

*Hint*: Check the [`pandas` cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) for plotting with `pandas`.

Based on the plot alone, do you think there is a correlation between the two? How can you tell from the plot?

**Bonus**: Do the same with `matplotlib`. You can find the cheat sheet for it [here](https://matplotlib.org/cheatsheets/_images/cheatsheets-1.png).

# Task 1.2 - Data Manipulation
Create a new column, called Name Length, that contains the length of the player's name.

*Hint*: Split this into two steps:
* creating a new column (check the `pandas` cheat sheet or last week's notebook if you don't remember how to do this)
* calculating the length of the player's name **for each row**
    
*Hint2*: If you have trouble with calculating the length of the player's name, have a look at the *Summarize Data* section in the cheat sheet.

# Task 1.3 - Scatter Plot
Create a Scatter Plot of player weight vs. name length. Do you see a correlation between the two? Why or why not? What makes this plot different from the one of weight vs height?

*Hint*: You can do this exactly the same way as you did in Task 1.1. Solve that task first and then this one is essentially free.

# Task 1.4 - Linear Regression
Create a linear regression model that predicts the player's height based on the player's weight. What is the height of a player that weighs 80kg? What is the height of a player that weighs 100kg?

*Hint 1*: For the linear regression, use [`scikit-learn` and its `LinearRegression` model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

*Hint 2*: You have to reshape your data. You can do this with the `numpy.reshape` function. Have a look at the [`reshape` function](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html) for how to continue from there.

The task of linear regression is to find the solution to the mapping:

$ f(X) = AX + b $

* `X` is the training data: usually a matrix with the features as columns and the rows the examples
* `A` is the linear mapping matrix (also called the _coefficient_)
* `b` is the offset (also called _intercept_)
* `f` is the function to map the training data to the target `y`


Note, that `X` is a matrix, however, the `weight` is a column vector, so you have to reshape it to a matrix.

Try using it first without reshaping and understand the occuring error message.

In [None]:
from sklearn.linear_model import LinearRegression

X = ...
y = ...

model = LinearRegression()
model.fit(X, y)

In [None]:
model.predict(...)

# Task 1.5 - Scatter Plot with Regression Line
Create a Scatter Plot of player weight vs. height. Draw the regression line into the scatter plot.

*Hint*: You can use `plt.plot` to draw the regression line. Have a look at the [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) for how to use it. If you didn't do the bonus part of Task 1.1, you can have a look at the [`matplotlib cheat sheet`](https://matplotlib.org/cheatsheets/_images/cheatsheets-1.png).

*Hint2*: You can easily find the x-axis values. How can you get the y-axis values for the regression line? You already did this in a previous task!

**Bonus**: Repeat this for height vs. name length.

# Task 2 - SQLAlchemy
We will be using [`sqlalchemy`](https://www.sqlalchemy.org/) here. First we store the data from the DataFrame in a sqlite3 database.

You can find an `sqlalchemy` cheat sheet [here](https://www.pythonsheets.com/notes/python-sqlalchemy.html).

In [None]:
import sqlite3
conn = sqlite3.connect('crowdstorming.db')
try:
    df.to_sql('crowdstorming', conn)
except ValueError:
    pass

from sqlalchemy import create_engine, MetaData

engine = create_engine('sqlite:///crowdstorming.db')
metadata = MetaData()
metadata.reflect(engine)

table_names = metadata.tables.keys()
print(table_names)

# Task 2.1 - SQL Query with SQLAlchemy
Write a query that returns the player's weight, height, and position using SQLAlchemy.

*Hint*: Check the cheat sheet for how to make queries.

In [None]:
from sqlalchemy.orm import sessionmaker

# Get the table object
crowdstorming_table = metadata.tables['crowdstorming']

# Create a session
Session = sessionmaker(bind=engine)
session = Session()

# Task 3.1 - Loading additional data

To enrich our data we will collect information about the countries. For this we will use an API.

- Make a GET request to https://restcountries.com/v3.1/all. You can use the [`requests` library](https://requests.readthedocs.io/en/latest/user/quickstart/) for this.
- Create a DataFrame called `countries_df` from the response
- Alternative: Load the data from the file `countries.json`
- You may need either [`pd.DataFrame.from_records`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_records.html), [`pd.read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html).

# Task 3.2 - Data Cleaning
The 'name' column contains dictionaries. This makes it annoying for us to work with.
Simplify the column by replacing all entries in it with the value in 'common' in that dictionary.

*Hint*: You did something very similar in Task 1.2!

# Task 3.3 - Joining DataFrames

Combine the two DataFrames on the `leagueCountry` column. You can use [`pd.merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) for this.
For the DataFrame with the countries, you only need the `name` and `fifa` columns.

# Task 4 - Joining with SQL

First we save the data from the DataFrame in the database.

In [None]:
try:
    countries_df[['name', 'fifa', 'unMember']].to_sql('countries', conn)
except ValueError:
    pass

In [None]:
metadata = MetaData()
metadata.reflect(engine)
countries_table = metadata.tables['countries']
crowdstorming_table = metadata.tables['crowdstorming']

# Task 4.1 - Joining crowdstorming data and country data with SQL

Select all columns from the `crowdstorming` table, and only the `fifa` column from the `countries` table.
Then join the two tables on the `leagueCountry` column of the `crowdstorming` table and the `name` column of the `countries` table.


# Task 5 - Calculating the mean

Calculate the mean height and weight of each player in the database, using SQLAlchemy.

*Hint*: You can use `func.avg` from `sqlalchemy` for this.

Now repeat this, but on the DataFrame. Are the results the same?

# Task 6 - Calculating the mean per position

Calculate the mean height and weight of each player per position in the database, using SQLAlchemy.

*Hint*: You can use [`.group_by`](https://docs.sqlalchemy.org/en/20/core/selectable.html#sqlalchemy.sql.expression.Select.group_by) from `sqlalchemy` for this.

Now do the same with the DataFrame. Are the results the same?

# Task 7 - Calculating the mean per position and league
Calculate the mean height and weight of each player per position and per league in the database, using SQLAlchemy.

Now do the same with the DataFrame. Are the results the same?