# Data Pre-processing
## Session 3 - Unit 1 [Self Guided]

<img src=media/cleaning.jpg width=200/>

Welcome to the "Data Pre-processing" unit of the Python Academy! In this unit, you will learn:
  - Data Quality
  - String methods and Regex
  - Date and Time Formatting
  - Dealing with Categoricals
  - Missing Values Imputation
  - Merging data

In [1]:
import re
import datetime
import pandas as pd
import numpy as np

## Data Quality

Data is (almost) always not in the format we want and need. Data Scientists and Software Developers often find themselves investing more time that they would've wanted to standardize, clean, add and/or drop data, handle missing values, along with other steps just to make their data useable in some way. We often refer to the unofficial mantra "Garbage In, Garbage Out", meaning everything bad we input into our applications will come out wrong in the outputs side.

Data of bad quality can appear in many forms:
  - **incomplete**: missing data, not all values are observed;
  - **noisy**: values are observed but the measurements are not always exact;
  - **inconsistent**: data between columns does not match in logical terms.

With some intuition, you can expect how such data can affect the way we develop our applications. 

*Quiz: What was the most absurd data pre-processing step you had to implement?*

In [14]:
SPOTIFY_FPATH = '../02_session/data/spotify_top100_2010_2019.csv'

Remember how we talked about **constants** and how we should use them for variables that do not vary so much? Well, here they are. Since we will **use the file path** for our dataset **multiple times** in different code blocks, we **define it once in the beginning of the notebook**. Every time we want to load the data, we can re-use the file path by its name alone and avoid writing the full path.

## String Methods

Operations in Python are not just limited to numbers; we can also **manipulate and format text strings** with basic operations.

The first operation is **adding two strings** together, or as we eloquently like to call it, ***string concatenation***. We concatenate with the `+` plus sign. Since it doesn't make much sense to "add" each element of a string, we use the plus sign to attach one after the other instead. 

Another mathematically-inspired string operation is the ***string repetition***. Can you imagine what a repetition looks like in a calculator? Since you can't actually reply, we will just say it. It's **multiplication**, using the asterisk `*` sign. Since multiplying a number is repeatedly adding that same number, we can also repeat strings in Python.

In [7]:
print("my string" + " can be split" + " in multiple blocks," + " which I can" + " later concatenate. But if " + "I fail to add whitespaces in-between, " + "this"+"will"+"happen")

print("repeat after me: " + "Python! "  * 5)

my string can be split in multiple blocks, which I can later concatenate. But if I fail to add whitespaces in-between, thiswillhappen
repeat after me: Python! Python! Python! Python! Python! 


We may also want to **format values as strings**, which we can use the built-in `str()` method. This is required to concatenate strings, for example.

In [4]:
print("What is the value of the gold ratio? " + str( 1/2 + 5**0.5/2))

What is the value of the gold ratio? 1.618033988749895


From the multiple available ways to format strings, we advise the use **[f-strings](https://peps.python.org/pep-0498/)** that allow **literal string interpolation** and, as such, make their syntax quite intelligible at first glance. We **prefix** the strings with the letter `f` and encapsulate each expression we want to **evaluate within curly braces** `{}`. They also allow formatting of specific type of variables, like the 5 decimal point float we use below. Notice the syntax?

In [8]:
print(f"What is the value of the gold ratio? {1/2 + 5**0.5/2:.5f}")

What is the value of the gold ratio? 1.61803


A single string can be also formatted with **methods** called directly into the string we want to format. There are a lot of [string methods](https://docs.python.org/3/library/stdtypes.html#string-methods) available and you (probably) won't remember them all. So we advise you to just glance at all the available methods to get an idea for now. Later, when you find yourself at a crossroads, remember this cheatsheet and see how far you can go!

In [9]:
album = "hiatus kayote - tawk tomahawk"
print(album)
print(album.capitalize())                   # capitalize - first letter capitalized, rest lowercase
print(album.title())                        # title - titlecase, each word starts with capitalized then lowercase
print(album.split(' - '))                   # split - list of words in string, splitted by a given separator
print(album.center(50, '.'))                # center - define a width and fill with character

hiatus kayote - tawk tomahawk
Hiatus kayote - tawk tomahawk
Hiatus Kayote - Tawk Tomahawk
['hiatus kayote', 'tawk tomahawk']
..........hiatus kayote - tawk tomahawk...........


On top of everything we just saw for formatting strings, `pandas` also supports string formatting with the `.str` accessor (more info [here](https://pandas.pydata.org/docs/user_guide/text.html)). But we will get there in a minute (or two).

## Regex

Regular expressions are **sequences of characters** that specify a **search pattern** in text. They allow us to search, search-and-replace or validate text inputs.

There are a lot of patterns available (see [cheatsheet](https://regexr.com/) table below) and, by hearth, Regex takes some time to master. It's quite normal for developers to refer back to interactive regex helpers (that make the expressions come to life) so it is easier to understand whether some search pattern is working as intented. Some useful helpers include [RegExr](https://regexr.com/), [regex101](https://regex101.com/).

| Pattern      | Description                    | Type                     |
| ------------ | ------------------------------ | ------------------------ |
| `.`          | any character except newline   | Character classess       |
| `\w \d \s`   | word, digit, whitespace        | Character classess       |
| `\W \D \S`   | not word, digit, whitespace    | Character classess       |
| `[abc]`      | any of a, b, or c              | Character classess       |
| `[^abc]`     | not a, b, or c                 | Character classess       |
| `[a-g]`      | character between a & g        | Character classess       |
| `^abc$`      | start / end of the string      | Anchors                  |
| `\b\B`       | word, not-word boundary        | Anchors                  |

| Pattern      | Description                    | Type                     |
| ------------ | ------------------------------ | ------------------------ |
| `\. \* \\`   | escaped special characters     | Escaped characters       |
| `\t \n \r`   | tab, linefeed, carriage return | Escaped characters       |
| `(abc)`      | capture group                  | Groups & Lookaround      |
| `\1`         | backreference to group #1      | Groups & Lookaround      |
| `(?:abc)`    | non-capturing group            | Groups & Lookaround      |
| `(?=abc)`    | positive lookahead             | Groups & Lookaround      |
| `(?!abc)`    | negative lookahead             | Groups & Lookaround      |
| `a* a+ a?`   | 0 or more, 1 or more, 0 or 1   | Quantifies & Alternation |
| `a{5} a{2,}` | exactly five, two or more      | Quantifies & Alternation |
| `a{1,3}`     | between one & three            | Quantifies & Alternation |
| `a+? a{2,}?` | match as few as possible       | Quantifies & Alternation |
| `ab\|cd`     | match ab or cd                 | Quantifies & Alternation |

In the Python standard library, a `re` module provides the matching operations we need to learn Regex. 

Regex works by **combining patterns** (things we want to find) **and texts** (where we want to find such things).

In [19]:
pattern = '[A-Z][a-z]+'                 # capitall letter, followed by 1 or more lowercase
text = 'Michael league bill laurance Shaun martin Larnell Lewis'
re.findall(pattern, text)

['Michael', 'Shaun', 'Larnell', 'Lewis']

Alternatively, we can also use pandas to match text columns by regex patterns. This is available as a string accessor `.str` on a Series (e.g. DataFrame's column). We have multiple ways to interact with our search patterns on DataFrames (see table below).

| Method   | Description                                                    |
| -------- | -------------------------------------------------------------- |
| replace  | replace the search pattern with given value                    |
| contains | boolean whether search pattern is contained in target string   |
| extract  | extract and return capture groups of search pattern as columns |
| findall  | find all occurrences of search pattern similar to `re.findall` |
| match    | whether string matches search pattern                          |
| split    | string split, equivalent to `str.split()`                      |
| rsplit   | right string split, equivalent to `str.rsplit()`               |


# <span style="color:red"> NÃO ENTENDO QUE PATTERN É ESTA... </span>

In [4]:
def featuring_artists(fpath: str):
    df = pd.read_csv(fpath)
    df['feat'] = df['title'].str.contains(pat='feat')
    df['feat_artist'] = df['title'].str.extract(pat='.*feat.(.*)\)')
    df = df[['title', 'artist', 'feat', 'feat_artist']]
    return df

In [5]:
featuring_artists('data/spotify_1.csv').head()


Unnamed: 0,title,artist,feat,feat_artist
0,STARSTRUKK (feat. Katy Perry),3OH!3,True,Katy Perry
1,My First Kiss (feat. Ke$ha),3OH!3,True,Ke$ha
2,I Need A Dollar,Aloe Blacc,False,
3,Airplanes (feat. Hayley Williams of Paramore),B.o.B,True,Hayley Williams of Paramore
4,Nothin' on You (feat. Bruno Mars),B.o.B,True,Bruno Mars


## Merging data

Merging data is a common data manipulation task in data science, analysis and engineering. It involves combining data from multiple sources based on common columns or indexes. Pandas provides several functions for merging dataframes, including **merge**, **join**, and **concat**.

The merge function allows us to merge dataframes based on one or more common columns, while the join function is used to join dataframes based on their index or column. The concat function, on the other hand, is used to concatenate dataframes vertically or horizontally. 


### Merging

# <span style="color:red"> NÃO PERCEBO COMO USARIA AQUI O JOIN/ QUAL A DIFERENÇA COM O MERGE </span>

In [6]:
# store first 5 and last 5 rows in two different dataframes

song_artist = featuring_artists('data/spotify_1.csv')[['title','artist']]
song_feat =  featuring_artists('data/spotify_1.csv')[['title','feat_artist']]

In [35]:
# merging on song title

song_artist.merge(
   song_feat,
   how= 'inner',
   on= 'title' 
)

Unnamed: 0,title,artist,feat_artist
0,STARSTRUKK (feat. Katy Perry),3OH!3,Katy Perry
1,My First Kiss (feat. Ke$ha),3OH!3,Ke$ha
2,I Need A Dollar,Aloe Blacc,
3,Airplanes (feat. Hayley Williams of Paramore),B.o.B,Hayley Williams of Paramore
4,Nothin' on You (feat. Bruno Mars),B.o.B,Bruno Mars
5,Magic (feat. Rivers Cuomo),B.o.B,Rivers Cuomo
6,The Time (Dirty Bit),Black Eyed Peas,
7,Imma Be,Black Eyed Peas,
8,Talking to the Moon,Bruno Mars,
9,Just the Way You Are,Bruno Mars,


In [41]:
### Concat

first5_songs = featuring_artists('data/spotify_1.csv').head()
last5_songs =  featuring_artists('data/spotify_1.csv').tail()

In [44]:
ten_songs = pd.concat([first5_songs, last5_songs], axis = 0)

Unnamed: 0,title,artist,feat,feat_artist
0,STARSTRUKK (feat. Katy Perry),3OH!3,True,Katy Perry
1,My First Kiss (feat. Ke$ha),3OH!3,True,Ke$ha
2,I Need A Dollar,Aloe Blacc,False,
3,Airplanes (feat. Hayley Williams of Paramore),B.o.B,True,Hayley Williams of Paramore
4,Nothin' on You (feat. Bruno Mars),B.o.B,True,Bruno Mars
5,Magic (feat. Rivers Cuomo),B.o.B,True,Rivers Cuomo
6,The Time (Dirty Bit),Black Eyed Peas,False,
7,Imma Be,Black Eyed Peas,False,
8,Talking to the Moon,Bruno Mars,False,
9,Just the Way You Are,Bruno Mars,False,


## Missing Values

In Data Science and Software Development, something being *missing* or *absent of value* is often considered in the same way and broadly referred as **missing values**. Missing values can also be referred as NA (not available) or NaN (not a number).

In Python, we can define a missing value with `None`, a reserved keyword that symbolizes a null value (absent) or no value at all (missing). See the difference?

With `pandas`, support for [missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) is much more flexible and can even depend on the type of data you are working with (e.g. NaN for numericals, NaT for datetime). Some of the most useful functionalities include:
  - `.isna()` creates a **mask of booleans identifying missing** values;
  - `.fillna()` **replaces missing values** with non-NA;
  - `.dropna()` **excludes data** (rows and/or columns) that include missings.


Let's look at our data again and simulate some missingness. Afterwards, we can try out the new tools we just saw to fix it!

In [15]:
def missing_data(fpath: str, pct_missing: float = 0.8):
    df = pd.read_csv(fpath)                                     # read data
    df = df.mask(np.random.rand(*df.shape) > pct_missing)       # create random missing data
    df = df.iloc[:, :3]                                         # get first 3 columns only
    return df

miss_df = missing_data(SPOTIFY_FPATH)

In [18]:
df = pd.read_csv(SPOTIFY_FPATH)
df.head()

Unnamed: 0,title,artist,top genre,year released,added,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top year,artist type
0,STARSTRUKK (feat. Katy Perry),3OH!3,dance pop,2009,2022‑02‑17,140,81,61,-6,23,23,203,0,6,70,2010,Duo
1,My First Kiss (feat. Ke$ha),3OH!3,dance pop,2010,2022‑02‑17,138,89,68,-4,36,83,192,1,8,68,2010,Duo
2,I Need A Dollar,Aloe Blacc,pop soul,2010,2022‑02‑17,95,48,84,-7,9,96,243,20,3,72,2010,Solo
3,Airplanes (feat. Hayley Williams of Paramore),B.o.B,atl hip hop,2010,2022‑02‑17,93,87,66,-4,4,38,180,11,12,80,2010,Solo
4,Nothin' on You (feat. Bruno Mars),B.o.B,atl hip hop,2010,2022‑02‑17,104,85,69,-6,9,74,268,39,5,79,2010,Solo


In [17]:
miss_df.head()

Unnamed: 0,title,artist,top genre
0,,3OH!3,dance pop
1,My First Kiss (feat. Ke$ha),3OH!3,
2,I Need A Dollar,,pop soul
3,Airplanes (feat. Hayley Williams of Paramore),B.o.B,
4,Nothin' on You (feat. Bruno Mars),B.o.B,


In [16]:
# number of missings per column
miss_df.isna().sum() 

title        189
artist       185
top genre    201
dtype: int64

In [28]:
# replace the missings by a symbolic ??? mark
miss_df.fillna('???').head()

Unnamed: 0,title,artist,top genre
0,???,3OH!3,dance pop
1,???,3OH!3,???
2,I Need A Dollar,Aloe Blacc,pop soul
3,Airplanes (feat. Hayley Williams of Paramore),???,atl hip hop
4,???,B.o.B,atl hip hop


## Recap

Congratulations, you made it all the way the "Data Pre-Processing" unit! We covered a LOT of different parts of data cleaning and your head must be spinning right now, but rest assured these topics will turn out to be useful in the future and you will be able to learn them on a deeper level. For now, let's just recap and appreciate the ride we just did. By the end of this notebook, you should have a clear idea of:
  1. Why Data Cleaning is an unwanted, ever-present aspect of developing software;
  2. How string methods can help you with text data (incl. Regex and search patterns);
  3. Merging and concatenating data;
  4. Missing Values.