<img src='./fig/vertical_COMILLAS_COLOR.jpg' style= 'width:70mm'>
<h1 style='font-family: Optima;color:#ecac00'>
Máster en Big Data. Tecnología y Analítica Avanzada (MBD).
</h1>

<h1 style='font-family: Optima;color:#ecac00'>
Fundamentos Matemáticos del Análisis de Datos (FMAD). 2022-2023.
</h1>

<h1 style='font-family: Optima;color:#ecac00'>
01a Python Setup Checks
</h1> 


**To test that your python setup is looking good please select the `Cell` menu above and click on `Run All`**


## Check your Conda environment and Python version

Look for the active environment in the output of the following command. It should be `fmad`. Also check the Python version you are using.

In [None]:
!conda info

## Check the Basic Libraries for Data Science with Python

### NumPy 

Check the numpy version and make sure that no errors appear in the output of the code cells below.

In [None]:
import numpy as np
print(np.__version__)

**Note about standard imports:**

+ Many of the Python libraries we will use have standard import names. Even though these are not official names, we strongly recommend their use, in order to make your code readable and compatible. For example, as we have just seen, NumPy should always be imported as `np`.

In [None]:
a = np.random.rand(1000)
%timeit a @ a

### Pandas

Similarly for Pandas and the rest of the libraries.

In [None]:
# Standard import for Pandas
import pandas as pd
print(pd.__version__)

### MatplotLib and Seaborn

In [None]:
# Standard import for Matplotlib
import matplotlib as mpl 
print(mpl.__version__)

# But we will more frequently use this import sequence
import matplotlib.pyplot as plt

In [None]:
# Standard import for Seaborn
import seaborn as sns
print(sns.__version__)

### Scikit-learn


In [None]:
# Standard import for Scikit-learn
import sklearn as sk 
print(sk. __version__) 

## Working with Pandas DataFrame, First Example.

+ For our first example we will use a data set called `titanic`, which is contained in the Seaborn library that you should have already installed. The data set contains information about the passengers of the Titanic, such as their age, gender, class in which they were traveling, whether they survived the ship shinking, etc.  

+ We will soon see how to use Python to read data from different sources: csv and Excel files, urls, databases, APIs, etc. But for now we just want to run some tests and get an overview of the data structures we will be working with.  

In [None]:
titanic = sns.load_dataset('titanic')
print(type(titanic))

As you can see, `titanic` is now a Pandas DataFrame. This is the object that we will most frequently meet when working with data tables (alongside with NumPy arrays). To see the first lines of the `titanic` data set we can use the `head` method. The optional `n` argument determines the number of rows in the output (the default is `n = 5`). 

In [None]:
print(titanic.head(n = 4))

**Exercise:** try running the above with other values of `n`. Also run it as `print(titanic.head(4))`. Use the *Help* menu to open the *Reference* link and look up the information about the `head` method for Pandas DataFrames. 

**Note:** you may have noticed sone `NaN` values in the `deck` column of the table. This are *missing data*. We will see how to deal with missing data later in the course. 

### Basic properties of a DataFrame

How many rows and columns of data are there in this data set? We get the answer with `shape`:

In [None]:
print(titanic.shape)

If you just use

```python
print(titanic)
```

you will get a summary of the table and the shape information will appear below. 

To see the *column names* we use, quite naturally:

In [None]:
print(titanic.columns)

As we see, the column names are stored in an `Index` object. We will learn more about indices later in the course. 

Sometimes the row names can contain information as well. They are also stored in an index which can be accessed with:

In [None]:
print(titanic.index)

### Accessing the data

+ We can get any element in the data table using brackets, the `iloc` method and row/column pairs. For example:

In [None]:
titanic.iloc[1, 3]

shows that the element of the second row and fourth column in the table is `38.0`. 

**Always keep in mind that Python counts are zero-based!!** Thus index zero corresponds to the first element in an ordered set.

In a data table the columns usually correspond to variables and the rows correspond to observations. And we should have a very good reason to do otherwise!! In the above example, the fourth column corresponds to the `age` variable of the Titanic passengers. It is often better to refer to variables by their names. We can do it with the `loc` method:

In [None]:
titanic.loc[1, 'age']

In this cases the row indices are numeric (in fact, consecutive zero-based integers), so we can use 1 with `loc` or `iloc` to select a row. 

We could also first extract the `age` column:

In [None]:
ttnc_age = titanic['age']
print(ttnc_age)

What kind of object is this column?

In [None]:
print(type(ttnc_age))

As we see, it is a *Pandas Series* object. A DataFrame can be considered as a collection of Series (columns) with a common (row) index.  

**Note:** the `age` column can also be accessed as an attribute of the table, with `titanic.age`. Usually we prefer the bracket method, but this *attribute notation* can be handy when we want to shorten our code sentences. 

### Modifying data

+ The `loc` and `iloc` methods can also be used to modify elements of the table. Let us modify the age of that passenger and check the result with head:

In [None]:
titanic.iloc[1, 3] = 19.0

print(titanic.head())

**Exercise:** now use `iloc` to return the age of the passenger to the original `38.0` value and check your work with `head`. 

### Accessing larger portions of the DataFrame (indexing and slicing)


+ If we want to access several rows and or columns we can use explicit *indexing* (note that the order is relevant) 

In [None]:
titanic.loc[[0, 1, 2, 3, 4] , ['sex', 'age', 'survived', 'pclass']]

or we can use *slicing* with the colon`:` to get consecutive rows or columns:

In [None]:
titanic.loc[:4 , 'survived':'age']

**Warning:** contrary to the usual Python (and NumPy) convention, the final values 4 and age of the indices are included in the output for `loc`.

+ In the above we have used `loc` but you can also use numeric positions with `iloc`. **Note**, however, that the final values of the numeric indexes are excluded for `iloc` output:

In [None]:
titanic.iloc[:4, :3]

### Condition based filtering

+ One of the most powerful tools in data analysis is the ability to filter only those rows of a data table that meet some condition, usually expressed as a *boolean* condition. For example, we will filter the `titanic` DataFrame, keeping only the rows corresponding to female passengers with ages equal or greater than 25 years (observe the use of the attribute notation we mentioned before).   

In [None]:
ttnc_female_25plus = titanic.loc[(titanic.age >= 25) &  (titanic.sex == 'female')]

print(ttnc_female_25plus.head())

## Some final fun for the session

+ Create a subfolder called `data` in the same folder that contains your Jupyter notebooks for this session  Download the [data set in this link](https://raw.githubusercontent.com/mbdfmad/fmad2223/main/data/gapminder.csv)  to that folder .

In [None]:
gapminder = pd.read_csv('data/gapminder.csv', index_col = 0)
print(gapminder.head())
gdp_cap = gapminder.iloc[:, -1].tolist()
life_exp = gapminder.iloc[:, -2].tolist()
pop = gapminder.iloc[:, 2].tolist()
cont = gapminder.iloc[:, 3].tolist()
continentColors = {
    'Asia':'red',
    'Europe':'green',
    'Africa':'blue',
    'Americas':'yellow',
    'Oceania':'black'
}

In [None]:
plt.rcParams_old = plt.rcParams

In [None]:
%matplotlib inline

sns.set()

plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 100

col = [continentColors[item] for item in cont]

# Specify c and alpha inside plt.scatter()
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) /1000000,
c = col, alpha = 0.4)

# Customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])
plt.grid(True)

# Show the plot
plt.show()
plt.clf()

In [None]:
from ipyleaflet import Map

Map(center=[40.43000779017108, -3.7125999200687105], zoom=17)