# Practising with Python

**Author:** 'Felipe Millacura'

**Date:** '10th January 2021'

## Learning Objectives

* Practice with the `pandas` and `numpy` libraries 
* Be able to do basic data manipulation and work with  NAs


* Load in the dataset `starbucks_drinkMenu_expanded.csv`, calling it `drinks_content`, and then briefly explore it, printing out: the first 5 rows, last 5 rows, and then all the data. Find out the dimensions of the data (number of rows and columns) and the column names.

* We're going to be looking at the number of `Calories` in each drink.
Calculate some quick summary statistics to check the mean, the variance, and the normality of this variable (i.e. plot a histogram).

* Check if you have any outliers in the `Calories` variable by creating a boxplot. (There is no need to change or remove any outliers you find)

* Select the variables `Beverage_category`, `Beverage`, `Beverage prep` and `Calories` from the `drinks_content` data frame, and assign the selected columns to a new data frame called `drinks`. Check if there are any `NaN` values anywhere in the data, and drop any rows containing them.

* Filter the data so we only take "Classic Espresso Drinks", and save this in a new data frame called `espresso_drinks`.

* Group your `espresso_drinks` data frame by the type of beverage prep, and then find out the mean calories for each drink group.

* Get the same grouped mean `Calories` values as above, but this time sorted in descending order.


# Analysis tasks

In the following analysis tasks, we are going to use a data set on incidences of forest fires in the north east of Portugal. You can find a description of the dataset [here](https://archive.ics.uci.edu/ml/datasets/forest+fires).

## Read data

* Import the `pandas` and `numpy` packages.

* Now read the file `forestfires.csv` and look at the first few rows.

* Have a look at the methods available on a `pandas` dataframe. 
  - We've already seen `describe()`, so run that on the dataframe you loaded above. 
  - Run another method providing general information on the stored data. 

## Missing data

* Which variables having missing data in this data frame?
  - Replace all the missing values in `area` with 0
  - Remove the rows that have missing values in other columns

## Using the "Verbs"

### Select

* Change your data frame so that columns `X` and `Y` are dropped.

### Arrange

* Change your data frame so that it is arranged by `area`, so that the highest `area` fires are first


### Filter

* Change your data frame so that it contains no rows where `area` is zero.

### Mutate

* Create a new column `is_rain`, which is equal to `True` whenever `rain` is greater than zero and `False` otherwise.

### Group by and summarise

* Find the average `area` of fire:
  - In each `month` of the year
  - In each `day` of the week
  - When there is rain, and when there isn't rain

## Plotting

* Use `matplotlib` to create a histogram of `area`
* Use `seaborn` to create a histogram of `area`
* Use seaborn to create a scatter plot of `temp` vs. `area`

# Python Machine Learning - Homework

Use the forest fire data to build a linear regression model to predict the size of a forest fire. No need to explore the data first, since you did that already!

**Hint:** you are going to need to use `scikit-learn` you can check the documentation [here](https://scikit-learn.org/stable/modules/linear_model.html)

```python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
```

**Hint 2:** Careful with NAs! you can transform categorical variables by using `pd.get_dummies()`. Check the documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

**HInt 3:** For any machine learning algorithm you need to split your data. You can use `train_test_split` for that. Check it out! [click me!](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)