# Introduction to Data Science and Machine Learning

<p align="center">
    <img width="699" alt="image" src="https://user-images.githubusercontent.com/49638680/159042792-8510fbd1-c4ac-4a48-8320-bc6c1a49cdae.png">
</p>

## Pandas
Here we present a great instrument that will be in your data science toolbox for a long time from now.

[Pandas library](https://pandas.pydata.org/) defines and makes use of a new _data structure_, _i.e._ the `DataFrame`. 
Actually pandas define more than just a data structure, for instance we will make use of `Series` and examine the difference with dataframes.

### Advantages of pandas

Data scientists use Pandas for its following advantages:

* Easily handles missing data;
* It provides an efficient way to slice the data;
* It provides a flexible way to merge, concatenate or reshape the data;
* It includes a powerful data casting tool to work with;
* It wraps data visualisation libraries in order to quickly plotting analysis results.

As you can imagine it is not all fun and games: the main disadvantage of pandas is that it is relegated to manipulate dataframes whose dimension is strictly lower than memory. 
For bigger-than-memory datasets we need other libraries (`dask`, `duckdb`, `pyspark`, etc.). However, this is far and beyond the scope of this course, so we will focus on manageable dataframes for now (that with the modern computers memories can be quite huge in any case).

### Dataframes

A `DataFrame`, roughly speaking, is a table. As any other type in python, it is defined as an object, with its attributes and methods. 
I strongly advice to have a look at the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

More formally, it is a rank-$2$ array, with axes labelled as _rows_ and _columns_. It is the basic object in pandas and a really common way to load data in memory in order to operate on them.

Now the question you are all wondering: how to _create a dataframe_. There are several ways, by tuples, by lists, by numpy arrays or even by dictionaries. 
As a first instance, let's consider a list of names corresponding to people and their age, you can create a data frame in this way:

In [1]:
import pandas as pd # pd is a standard alias for pandas library.

# List of lists made by [str, int]
lst = [['peter', 25], ['oscar', 30],
       ['tony', 26], ['bruce', 22]]

df = pd.DataFrame(lst, columns=['Name', 'Age'])
df

Unnamed: 0,Name,Age
0,peter,25
1,oscar,30
2,tony,26
3,bruce,22


In [2]:
# Dict made by {str: int}
data_dict = {'peter': 25, 'oscar': 30, 'tony': 26, 'bruce': 22}

df = pd.DataFrame(data=data_dict.items(), columns=['Name', 'Age'])
df

Unnamed: 0,Name,Age
0,peter,25
1,oscar,30
2,tony,26
3,bruce,22


In [3]:
# Read from csv file
df = pd.read_csv('datasets/people.csv', header=None, names=['Name', 'Age'])
df 

Unnamed: 0,Name,Age
0,'peter',25
1,'oscar',30
2,'tony',26
3,'bruce',22


As you can see we have the exact same object. Once data are organised in the dataframe, no matter how we imported them, they are stored in that object that has always the same methods and attributes.

### Series

A `Series` is a one-dimensional data structure. It can have any data structure like integer, float, and string, or even composite ones like lists, dictionaries, etc. 

It is useful when you want to perform computation or return a one-dimensional array. A series, by definition, cannot have multiple columns. For the latter case, use the data frame structure, which indeed can be considered as made up by series.

Series has one parameters, the data, that can be a list, a dictionary, or a scalar value:

In [4]:
pd.Series([1., 2., 3.])

0    1.0
1    2.0
2    3.0
dtype: float64

### Read from files

We have already seen an example of importing data stored in a file into a dataframe. Now we want to go into some more details.

Data can be loaded in a DataFrame from different data format, like csv, xlx, json, parquet, etc.

For example, you have already met my old friend `read_csv` method.

In [5]:
df = pd.read_csv('datasets/pandas_tutorial_read.csv', delimiter=';')
df

Unnamed: 0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
0,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
1,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
2,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
3,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America
4,2018-01-01 00:05:42,read,country_6,2458151266,Reddit,North America
...,...,...,...,...,...,...
1789,2018-01-01 23:57:14,read,country_2,2458153051,AdWords,North America
1790,2018-01-01 23:58:33,read,country_8,2458153052,SEO,Asia
1791,2018-01-01 23:59:36,read,country_6,2458153053,Reddit,Asia
1792,2018-01-01 23:59:36,read,country_7,2458153054,AdWords,Europe


This dataset holds the data of a travel blog.

It is noteworthy a default behaviour in pandas `read_csv`. The csv file do not have a header row, therefore pandas used the first row of data as header; in order to set the name of the columns you can use the `name` parameter.

In [31]:
df = pd.read_csv('datasets/pandas_tutorial_read.csv', delimiter=';',
                 names=['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'],
                 parse_dates=True)
df

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
2,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
3,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
4,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America
...,...,...,...,...,...,...
1790,2018-01-01 23:57:14,read,country_2,2458153051,AdWords,North America
1791,2018-01-01 23:58:33,read,country_8,2458153052,SEO,Asia
1792,2018-01-01 23:59:36,read,country_6,2458153053,Reddit,Asia
1793,2018-01-01 23:59:36,read,country_7,2458153054,AdWords,Europe


Sometimes, it might be handy not to print the whole dataframe and flood your screen with data. 
When a few lines is enough, you can print only the first $n$ lines – by typing:

```python
df.head(n)
```

If you leave the $n$ parameter blank, the method takes the default value, that is $5$.


In [9]:
df.head()

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
2,2018-01-01 00:04:01,read,country_7,2458151263,AdWords,Africa
3,2018-01-01 00:04:02,read,country_7,2458151264,AdWords,Europe
4,2018-01-01 00:05:03,read,country_8,2458151265,Reddit,North America


By symmetry, you can imagine what the `tail` method returns.

In [10]:
df.tail()

Unnamed: 0,my_datetime,event,country,user_id,source,topic
1790,2018-01-01 23:57:14,read,country_2,2458153051,AdWords,North America
1791,2018-01-01 23:58:33,read,country_8,2458153052,SEO,Asia
1792,2018-01-01 23:59:36,read,country_6,2458153053,Reddit,Asia
1793,2018-01-01 23:59:36,read,country_7,2458153054,AdWords,Europe
1794,2018-01-01 23:59:38,read,country_5,2458153055,Reddit,Asia


We might also need a random sampling of $k$ lines out of the dataframe, this can be achieved by the `sample` method.

In [11]:
df.sample(7)

Unnamed: 0,my_datetime,event,country,user_id,source,topic
856,2018-01-01 11:35:12,read,country_5,2458152117,SEO,Australia
1489,2018-01-01 20:15:14,read,country_5,2458152750,SEO,North America
801,2018-01-01 10:56:30,read,country_2,2458152062,Reddit,Asia
922,2018-01-01 12:19:21,read,country_5,2458152183,SEO,North America
106,2018-01-01 01:35:40,read,country_5,2458151367,SEO,North America
1279,2018-01-01 17:17:37,read,country_1,2458152540,AdWords,Europe
125,2018-01-01 01:51:51,read,country_2,2458151386,SEO,South America


Other two dataframe methods that are very useful in analysing data are `describe` and `info`.

The `describe` method allows to get some statistical information about our data.

In [12]:
df.describe()

Unnamed: 0,user_id
count,1795.0
mean,2458152000.0
std,518.3162
min,2458151000.0
25%,2458152000.0
50%,2458152000.0
75%,2458153000.0
max,2458153000.0


Note how the result is again a dataframe (try to execute `print(df.decribe())`) whose index is a list of statistical properties, and as columns the values of indexed properties for the starting dataframe columns.

As one can read in the [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) the method returns a Summary statistics of the Series or Dataframe provided.

Again, from the official documentation 

> For numeric data, the result’s index will include `count`, `mean`, `std`, `min`, `max` as well as lower, $50$ and upper percentiles. By default the lower percentile is $25$ and the upper percentile is $75$. The $50$ percentile is the same as the median.
>
> For object data (e.g. strings or timestamps), the result’s index will include `count`, `unique`, `top`, and `freq`. The `top` is the most common value. The `freq` is the most common value’s frequency. `Timestamps` also include the first and last items.
>
> If multiple object values have the highest count, then the `count` and `top` results will be arbitrarily chosen from among those with the highest count.
>
> _For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type._
>
> The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

Hence you can explain why the cell above only returns summary related to the column `user_id`.

Let's try to set `include='all'` parameter.

In [13]:
df.describe(include='all')

Unnamed: 0,my_datetime,event,country,user_id,source,topic
count,1795,1795,1795,1795.0,1795,1795
unique,1773,1,8,,3,6
top,2018-01-01 03:10:36,read,country_2,,Reddit,Asia
freq,3,1795,462,,949,667
mean,,,,2458152000.0,,
std,,,,518.3162,,
min,,,,2458151000.0,,
25%,,,,2458152000.0,,
50%,,,,2458152000.0,,
75%,,,,2458153000.0,,


On the other hand, we also have `info` method.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   my_datetime  1795 non-null   object
 1   event        1795 non-null   object
 2   country      1795 non-null   object
 3   user_id      1795 non-null   int64 
 4   source       1795 non-null   object
 5   topic        1795 non-null   object
dtypes: int64(1), object(5)
memory usage: 84.3+ KB


This method actually returns a `NoneType`. But in the execution _prints_ on screen some information about the dataframe. 
It is less informative (almost not informative at all) from the statistics point of view, but it tells us some numerical property of the dataframe, indeed this method prints information about the DataFrame including the index `dtype` and columns, non-null values and memory usage.

#### Exercise

Read the documentation and than import data stored in [`Lectures_src/01.Pandas/datasets/vaccines.csv`](https://raw.githubusercontent.com/italia/covid19-opendata-vaccini/master/dati/consegne-vaccini-latest.csv).
1. Find how much memory the dataframe takes.
2. Count the null values in each column.
3. Check if there are parsing issues in column types.
4. Find the highest number of shots.

_Optional_: Always looking into the docs, try to parse date columns as datetime object.

### Filter by selecting columns

It is countless the number of times you will need to select specific columns in a dataframe. 
Pandas allows to use a very simle syntax for that, actually two (equivalent).

In [17]:
# first way, the square bracket notation
df['country']

0       country_7
1       country_7
2       country_7
3       country_7
4       country_8
          ...    
1790    country_2
1791    country_8
1792    country_6
1793    country_7
1794    country_5
Name: country, Length: 1795, dtype: object

In [18]:
# second way, the point notation
df.country

0       country_7
1       country_7
2       country_7
3       country_7
4       country_8
          ...    
1790    country_2
1791    country_8
1792    country_6
1793    country_7
1794    country_5
Name: country, Length: 1795, dtype: object

Note how both of the previous syntaxes return a `Series` rather than a `Dataframe`.

If you want a dataframe, you need to slightly change the previous commands in

In [23]:
# Hard way, no one uses that.
pd.DataFrame(df.country)

Unnamed: 0,country
0,country_7
1,country_7
2,country_7
3,country_7
4,country_8
...,...
1790,country_2
1791,country_8
1792,country_6
1793,country_7


In [24]:
# Easy way, it generalises easily to the multi-column case.
df[['country']]

Unnamed: 0,country
0,country_7
1,country_7
2,country_7
3,country_7
4,country_8
...,...
1790,country_2
1791,country_8
1792,country_6
1793,country_7


#### What about multi-column filter?

As the previous cell might suggest, you only need to pass a list of columns.

Note the double bracket `[[]]`, you can consider `[]` as a _filter_ operator, whose argument is the list of columns.
Recall that a `Series` admits only one column, hence the result of this operation cannot be other than a dataframe.

In [25]:
df[['user_id','country']]

Unnamed: 0,user_id,country
0,2458151261,country_7
1,2458151262,country_7
2,2458151263,country_7
3,2458151264,country_7
4,2458151265,country_8
...,...,...
1790,2458153051,country_2
1791,2458153052,country_8
1792,2458153053,country_6
1793,2458153054,country_7


The order of names changes the order in the returned dataframe.

In [26]:
df[['country', 'user_id']]

Unnamed: 0,country,user_id
0,country_7,2458151261
1,country_7,2458151262
2,country_7,2458151263
3,country_7,2458151264
4,country_8,2458151265
...,...,...
1790,country_2,2458153051
1791,country_8,2458153052
1792,country_6,2458153053
1793,country_7,2458153054


### Filter rows on values

There is complementary way of filtering a dataframe, on rows value. Hence, we can reduce the number of records in the dataframe based on some condition.

Let's use the imported dataframe, and for instance, you want to see the entries corresponding to the users who came from the "SEO" source. In this case you have to filter for the "SEO" value in the "source" column.

In [27]:
df[df.source == 'SEO']

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
11,2018-01-01 00:08:57,read,country_7,2458151272,SEO,Australia
15,2018-01-01 00:11:22,read,country_7,2458151276,SEO,North America
16,2018-01-01 00:13:05,read,country_8,2458151277,SEO,North America
...,...,...,...,...,...,...
1772,2018-01-01 23:45:58,read,country_7,2458153033,SEO,South America
1777,2018-01-01 23:49:52,read,country_5,2458153038,SEO,North America
1779,2018-01-01 23:51:25,read,country_4,2458153040,SEO,South America
1784,2018-01-01 23:54:03,read,country_2,2458153045,SEO,North America


In order to better understand the command above, let's focus on how pandas interpret the filtering procedure.

**Step 1**: First, between the bracket frames `[]` it evaluates every line: is the `df.source` column’s value `'SEO'` or not? The results are boolean values (True or False), better a `Series` of boolean values. 
Indeed, we have seen how `df.source` is a series, a comparison with a value (through the binary operator `==`) will produce a truth-value object of the same type of `df.source` hence a series.

In [28]:
# Note the dtype attribute
df.source == 'SEO'

0        True
1        True
2       False
3       False
4       False
        ...  
1790    False
1791     True
1792    False
1793    False
1794    False
Name: source, Length: 1795, dtype: bool

**step 2**: The previous boolean series is what is called a _mask_. If we filter through a mask, the filtered dataframes returns every row where the mask is `True` and drops any row where it is `False`.

In [29]:
# A less concise, but maybe clearer notation
mask_seo = (df.source == 'SEO') # Boolean series
df[mask_seo] # Masks away the rows corresponding to "False".

Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,2018-01-01 00:01:01,read,country_7,2458151261,SEO,North America
1,2018-01-01 00:03:20,read,country_7,2458151262,SEO,South America
11,2018-01-01 00:08:57,read,country_7,2458151272,SEO,Australia
15,2018-01-01 00:11:22,read,country_7,2458151276,SEO,North America
16,2018-01-01 00:13:05,read,country_8,2458151277,SEO,North America
...,...,...,...,...,...,...
1772,2018-01-01 23:45:58,read,country_7,2458153033,SEO,South America
1777,2018-01-01 23:49:52,read,country_5,2458153038,SEO,North America
1779,2018-01-01 23:51:25,read,country_4,2458153040,SEO,South America
1784,2018-01-01 23:54:03,read,country_2,2458153045,SEO,North America


It is obvious now that you can combine more conditions to end up into a boolean mask and apply even complicated filter.

_Example_: We want to filter the dataframe to get all the users coming from a "SEO" source, with topic related to "Asia" and with a timestamp between 23.00 and 23.30.

In [37]:
bool_mask = ((df.source == 'SEO') & (df.topic == 'Asia') & (df.my_datetime >= '2018-01-01 23:00:00') & (df.my_datetime <= '2018-01-01 23:30:00'))
df[bool_mask]

Unnamed: 0,my_datetime,event,country,user_id,source,topic
1736,2018-01-01 23:21:08,read,country_5,2458152997,SEO,Asia
1740,2018-01-01 23:23:20,read,country_7,2458153001,SEO,Asia


#### Bonus
There a lot of interesting ways of selecting columns out of a dataframe. I suggest this [nice post](https://towardsdatascience.com/interesting-ways-to-select-pandas-dataframe-columns-b29b82bbfb33#:~:text=Selecting%20columns%20based%20on%20their,Returns%20a%20pandas%20series.&text=Passing%20a%20list%20in%20the,columns%20at%20the%20same%20time.) in order to look at some non-standard examples.