# Homework exercises 1

## Objective



In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np 
import pandas as pd
from pathlib import Path

We set some default values for our project and check the file

In [3]:
datadir = Path("data/raw/")
outputdir = Path("data/processed/")
filename = datadir / "homework1.csv"
filename.resolve(), filename.exists()

(WindowsPath('C:/Users/jan-willem.lankhaar/OneDrive - Stichting Hogeschool Utrecht/Documents/02 Cursussen/2022-2023/C - TMO-DME-20 Data Mining & Exploration/repo/course/homework1/data/raw/homework1.csv'),
 True)

In [4]:
df = pd.read_csv(filename)
df.head()

Unnamed: 0,x1,x2,name
0,4,0.683287,Python Regius
1,5,0.787097,Python Regius
2,7,,Python Regius
3,9,0.802364,Python Regius
4,0,,Python Regius


Let's check some of the statistics

In [5]:
df.describe()

Unnamed: 0,x1,x2
count,100.0,82.0
mean,4.62,0.587542
std,2.834777,0.229508
min,0.0,0.203242
25%,2.0,0.377285
50%,4.5,0.632648
75%,7.0,0.786594
max,9.0,0.996141


From the count, you can see `x2` has some `NaN`s.
Let's select all columns with `NaN`s

In [8]:
is_nan_columns = list(df.isna().sum() > 0)
is_nan_columns

[False, True, False]

Check if it works

In [10]:
df.columns[is_nan_columns]

Index(['x2'], dtype='object')

Now drop the `NaN`s

In [11]:
df = df.dropna(subset=df.columns[is_nan_columns], axis="rows")
df

Unnamed: 0,x1,x2,name
0,4,0.683287,Python Regius
1,5,0.787097,Python Regius
3,9,0.802364,Python Regius
6,8,0.855227,Python Regius
7,9,0.861283,Python Regius
...,...,...,...
94,9,0.506065,Python Regius
95,7,0.785085,Python Regius
96,0,0.295006,Python Regius
97,1,0.768772,Python Regius


We dropped 18 rows.
Let's check the types:

In [65]:
df.dtypes

x1        int64
x2      float64
name     object
dtype: object

Let's clean out the name. We will use a regular expression to select the first word up to the first space. In a later lesson, we will study regular expressions in more detail.
Use https://regex101.com to create your own regular expressions.

In [12]:
import re

regex = re.compile("^[\w]+")        # One or more 'word' characters at the beginning of the string.
out = re.search(regex, "Python Regius")
out.group()

'Python'

Let's put that into a function

In [69]:
def extract(regex, msg):
    out = re.search(regex, msg)
    return out.group()

And apply it

In [72]:
df["name"] = df["name"].apply(lambda x: extract(regex=regex, msg=x))

In [107]:
df.head()

Unnamed: 0,x1,x2,name
0,4,0.683287,Python Regius
1,5,0.787097,Python Regius
3,9,0.802364,Python Regius
6,8,0.855227,Python Regius
7,9,0.861283,Python Regius


Looks good.

Now we save the file with a timestamp.

In [108]:
from datetime import datetime
tag = datetime.now().strftime("%Y%m%d-%H%M") + ".csv"
output = outputdir / tag
df.to_csv(output, index=False)

A lot of data scientists will stop here.

However, while the job is done, leaving things like this is very tricky.
Notebooks are for prototyping, not for creating a solid solution.

Now, have a look at the src folder.
Start at main.py, and also look at the other files.

Now go to the terminal, cd to the `les1` directory.
From there, you do:

`poetry shell`

`python src/main.py --file=les1.csv`

Note how a logging.log file appears, and check that.

# Excercise

In the `data/raw` folder in the root directory DME22 you will find a `palmerpenguins.parq` file. This is an [Apache Parquet](https://parquet.apache.org/) file. If you always use CSV, you might [read this](https://bawaji94.medium.com/feather-vs-parquet-vs-csv-vs-jay-55206a9a09b0) so you know why that's not always a good idea.

### 1. Set up a project directory structure
1. Create a new new folder (e.g. using `mkdir`) outside of the DME22 folder, named `cleanup`.
1. Initialize a new poetry environment
1. `poetry add` the libraries you need.
1. Create a `data/raw` folder and copy `palmerpenguins.parq` to it (e.g. using the `cp` command). 
1. Create a `data/processed`, `src` and `notebook` folder.
1. Create a `src/main.py` file

### 2. Create a prototype notebook
Make a notebook where you:
1. Load the file with `pandas`.
1. Check for `NaN`s.
1. Figure out how you can remove the `NaN`s.
1. Check very carefully how many rows you remove. Find a solution if the result doesn't look good.
1. Clean up the column with the names of the penguins. They are too long, so shorten them with a regular expression.
1. Save the cleaned file with a timestamp.

### 3. Implement the logic from the notebook in reproducable code
After you have created the prototype in a notebook, create an `__init__.py` file inside the `src` folder. 
Streamline the cleanup process as a command line executable process.
Use [click](https://click.palletsprojects.com/en/8.1.x/) to create easy arguments.

Try to add typehints.
Format your code with [black](https://github.com/psf/black) by running `black src` from the command line, where src is the folder you want to format. (make sure you cd-d to the folder so that you see `src` when you ls)

Add logging with loguru.

Don't hardcode any settings. Use pydantic with a settings.py file.



