# Introduction to Python - Session 2.1
1. Installing and using packages
2. Data wrangling:
    - The numpy package
    - The pandas package

SLIDES [HERE](https://docs.google.com/presentation/d/1IIdFbMlPzOAoTrdfg6pZ6NDUedwvD_Z7RZjmlM-V41A/export/pdf)

## EXERCISE 1 - Introduction to NumPy

In [31]:
import numpy as np

**1. Create an array `a` of random numbers and shape (3,4).**

**2. Add a fifth column to `a` with values 0, 0.5, and 1.**

**3. Find all values that are greater or equal to 0.5.**

**4. Replace all the first row with NAs.**

**5. Use matrix multiplication against the vector `b = np.array([1, 0, 10])`.**

**6. Element-wise multiplication of the same vectors `a` and `b`.** Note that `b` is broadcasted along all rows.

**7. Calculate the sum, the mean, and the median of each row of `a`. Use the so-called numpy functions.**

## EXERCISE 2 - Introduction to Pandas

In [2]:
import pandas as pd

**1. Create the following DataFrame `mydf`, with index `John, Jessica, Steve, Rachel` and columns `Age, Height, Sex`.**

```
43 	181 	M
34 	172 	F
22 	189 	M
27 	167 	F
```

**2. What is the shape of `mydf`?**

**3. Calculate the average age and height in `mydf`.**

**4. Add one row to `mydf`: Georges who is 53 years old, 168cm tall, and Male.**

**5. Change the row names of `mydf` so that the data becomes anonymous.** Use Patient1, Patient2, etc. instead of actual names.

**6. Create the DataFrame `mydf2` that is a subset of `mydf` containing only the female entries.**

**7. Import the data in `more_patients.tsv` in a DataFrame named `moredf`.**

**8. Create a DataFrame `mydf3` by concatenating `mydf` and `moredf`.**

**9. Calculate the number of male and female patients combining the `.groupby` and `.size` methods in `mydf3`.**

**10. Calculate the average age and height by sex combining the `.groupby` and `.mean` methods in `mydf3`.**

**11. Calculate the average age and height by sex using the `.groupby` and `.apply` methods in `mydf3`.**

**13. Standardize age and height by sex combining the `groupby` and `apply` methods in `mydf3`.**

## EXERCISE 3 - Analyzing COVID-19 data

Adapted from: https://www.w3resource.com/python-exercises/project/covid-19/index.php

Data Source: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports

**File naming convention**

MM-DD-YYYY.csv in UTC.

**Field description**

- Province/State: China - province name; US/Canada/Australia/ - city name, state/province name; Others - name of the event (e.g., "Diamond Princess" cruise ship); other countries - blank.
- Country/Region: country/region name conforming to WHO (will be updated).
- Last Update: MM/DD/YYYY HH:mm (24 hour format, in UTC).
- Confirmed: the number of confirmed cases.
- Deaths: the number of deaths.
- Recovered: the number of recovered cases.

**Upload the latest update of the dataset.**

In [3]:
covid_data= pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/04-20-2022.csv')

**1. Write a Python program to display first 5 rows from COVID-19 dataset. Also print the dataset information (`info()`) and check the missing values (`isna()`).**

**2. Write a Python program to get the latest number of confirmed, deaths, recovered and active cases of COVID-19 country-wise.** HINT: You can use the `groupby` fucntion.

**3. Write a Python program to get the Spanish `Province_State` cases of confirmed, deaths, recovered and active cases of COVID-19. Use `sort_values` to sort the values. Save the resulting dataframe as a csv file.**

**4. Make a bar plot of the deaths of the previous DataFrame.** Pandas has some very simple plotting function for DataFrames included, which can often be very convenient. Here, you can use the `DataFrame.plot.bar()` function. For more compplicated plots, the package MatPlotLib is recommended.

**5. Make a scatter plot of confirmed cases againts deaths for all `Province_State` of the previous DataFrame.** Use the `DataFrame.plot.scatter()` function.

## EXERCISE 4 - Gene annotation GFF3

[GFF is a standard file format](http://gmod.org/wiki/GFF3) for storing genomic features in a text file. GFF stands for Generic Feature Format. GFF files are plain text, 9 column, tab-delimited files.

The 9 columns of the annotation section are as follows:

- Column 1: "seqid" - The ID of the landmark used to establish the coordinate system for the current feature, a.k.a. chromosome name.
- Column 2: "source" - The algorithm or operating procedure that generated the feature.
- Column 3: "type" - The type of feature.
- Columns 4 & 5: "start" and "end" - The start and end of the feature.
- Column 6: "score" - The score of the feature, a floating point number.
- Column 7: "strand" - The strand of the feature.
- Column 8: "phase" - For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame.
- Column 9: "attributes" - A list of feature attributes in the format tag=value.

**1. Load the data in "GRCh38.gff3", which contains a random subset of features of the human genome. Show the first 5 instances.**

**2. Which types of features are included in the dataset? How many of each? Make a barplot showing these numbers.**

**3. Create a new column "len" that contains the length of each feature.**

**6. Extract the gene name of all instances from the "attributes" column. Include it in a new column.** HINT: You can use `^` in the regular expression.

**5. Microexons are defined as exons shorter or equal than 27 nucleotides. Find all microexons in the dataset.**

**6. Plot a histogram of the length of microexons. Use `plot.hist()`.**

## EXERCISE 5 - GDP dataset

The analysis was prepared based on the World Bank Data, particularly the dataset [World Development Indicatiors](http://databank.worldbank.org/data/reports.aspx?source=world-development-indicators) was utilized. This set contains many different economic development indicators you can choose from. For simplicity, we will use: GDP per capita (US\\$), GDP per capita growth (annual \%), GDP growth (annual \%), GDP (current US\\$).

**1. Load "GDP_last25years_08182020.csv" dataset. Missing data is written as "..", interpret it as NaN. Set the index of the DataFrame to "Series Name" and "Country Code" (multi-indexes are allowed in Pandas). Show the first five lines.**

**2. Note that column names are formated as "XXXX [YRXXXX]". Reformat it to XXXX.**

**3. Print the GDP (current US\\$) of Spain.**

**4. Which country has the higuest GDP per capita in 2019?**

**5. Make 4 plots: GDP per capita (US\\$), GDP per capita growth (annual \%), GDP growth (annual \%) and GDP (current US\\$) over the years. You will need to transpose the data with `T`,**

**6. To investigate whether different countries show the same trend over the years, make a correlation matrix of GDP per capita (current \\$US) using `corr()`.**

# Introduciton to python - Basic Data Visualization - Session 2.2

1. `seaborn` - quick data visualization
    - Categorical
    - Continuous
    - Categorical vs. continuous
    - Continuous vs. continuous
    - \>2 variables
    - "meta" plotting functions
2. `matplotlib` - full control of figures
    - reproduce a seaborn figure
3. Customize your plots
    - with `seaborn`
    - with `matplotlib`
4. Create and save your plot 

In python, there are two main packages used to visualize data: [`matplotlib`](https://matplotlib.org/) and [`seaborn`](https://seaborn.pydata.org/). 
Although `matplotlib` set the base in data visualization in Python, it becomes cumbersome to make quick plots. Hence, the `seaborn` library was born, as a wrapper of the latter adapted to make plotting simple and quick.

Their usage is so widespread that many packages depend on these to visualize data. Packages like `pandas`, rely on these libraries to make their visualizations.

In this session, we will use the [`palmerpenguins`](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) dataset to workout how we can easily visualize data with these libraries and how they can be further leveraged to edit every bit of your plot.

In [2]:
# first, load the essential packages to read and wrangle tables
import pandas as pd
import numpy as np

In [5]:
# download and read the table from their online repository
data = pd.read_csv("data/penguins.csv")
data = data.dropna()
data

Unnamed: 0.1,Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
...,...,...,...,...,...,...,...,...,...
339,339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


This dataset contains different types of information on these penguins. We can classify these into categorical or continuous types of data. Below you'll find a summary of what is contained in every variable

In [5]:
data.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,year
count,333.0,333.0,333.0,333.0,333.0
mean,43.992793,17.164865,200.966967,4207.057057,2008.042042
std,5.468668,1.969235,14.015765,805.215802,0.812944
min,32.1,13.1,172.0,2700.0,2007.0
25%,39.5,15.6,190.0,3550.0,2007.0
50%,44.5,17.3,197.0,4050.0,2008.0
75%,48.6,18.7,213.0,4775.0,2009.0
max,59.6,21.5,231.0,6300.0,2009.0


## `seaborn` - quick data visualization

As mentioned above, `seaborn` was born as a `matplotlib` wrapper to facilitate data visualization from dataframes.

Now, we will explore some of the different functions that we can use to quickly explore the dataset by asking questions.

But first, as usual, we need to load the package:

In [6]:
import seaborn as sns

### Categorical - `countplot` and `barplot`

Categorical variables contain information on how we can group our observations into similar buckets or classes. For that, barplots are a great tool

**How many penguins of each species are there?**

**1. Use `sns.countplot` setting the 'data' and 'x' arguments.**

In [1]:
# try using sns.countplot using the parameters 'data' and 'x'
# to introduce the dataframe and the variable to plot counts from, respectively


**2. Save your plot into an object called 'g'.**

**3. Relabel the X and Y axes using the methods `.set_xlabel` and `.set_ylabel`**

Note that we didn't need to provide the counts for each species to the function! That sped up the process! However, for the next visualizations we'll need to be more explicit... So, let's count the classes first.

**4. Switch to `sns.barplot` to visualize the number of penguins of each species.** First, you'll need to create a new dataframe counting the number of times each penguin species appears. Try to combine the `pandas.DataFrame` methods `.groupby`, `.size`, `reset_index`. Now, you'll have to define both axes variables to the ploting function.

In [2]:
# alternatively, count the penguins for each species and plot


**How many penguins from each island and species are there?**

Note that colors and x-axis are redundant in the plot above. Maybe we can exploit that to visualize more information. The "hue" parameter allows you to further split the bar plot into the selected category. See how grouping our data makes our lifes simpler to ask more complicated questions.

In [3]:
# try using the 'hue' parameter in sns.barplot


### Continuous - `histplot` and `kdeplot`

Continuous variables give us numerical information on the observations, providing us with a distribution of values for a certain feature, like penguins bill length.

**What is the distribution of bill lengths?**

In [4]:
# the 'bins' parameter sets the number of bins to partition our countinuous variable


**What is the distribution of bill lengths across species?**

### Categorical vs. Continuous - `boxplot`, `violinplot`,`stripplot`,`swarmplot`

We also find a series of plots for those moments when we need to know how continuous variables may differ between groups

**What are the distributions of bill length between sexes across species?**

Now try using violin plots.

And strip plots, a.k.a. jitter plots.

Or swarm plots.

### Continuous vs. Continuous - `scatterplot`, `kdeplot`, `jointplot`

Finally, sometimes we are interested in how two variables continuous variables covariate together.

**What is the relationship between bill length and body mass in penguins of different species?**

With `jointplot` we get three for the price of one!

### >2 variables

**What is the relationship between all continuous variables considering all penguin species?**

#### `pairplot`

#### `heatmap`

#### `clustermap`

### "meta" plotting functions

#### `catplot`

**How many penguins from each island and species of each sex are there?**

In [5]:
# try using the 'hue' and 'col_wrap' parameters in sns.catplot


#### `lineplot`

**What is the relationship between body mass and bill length across species and sexes?**

## `matplotlib` - full control of the plot

In [65]:
import matplotlib.pyplot as plt

### reproduce a `seaborn` figure

## Customize your plots

### `matplotlib`

In [185]:
import matplotlib.font_manager as font_manager

### `seaborn`

## Create and save your plot

In [198]:
help(plt.savefig)

Help on function savefig in module matplotlib.pyplot:

savefig(*args, **kwargs)
    Save the current figure.
    
    Call signature::
    
      savefig(fname, *, dpi='figure', format=None, metadata=None,
              bbox_inches=None, pad_inches=0.1,
              facecolor='auto', edgecolor='auto',
              backend=None, **kwargs
             )
    
    The available output formats depend on the backend being used.
    
    Parameters
    ----------
    fname : str or path-like or binary file-like
        A path, or a Python file-like object, or
        possibly some backend-dependent object such as
        `matplotlib.backends.backend_pdf.PdfPages`.
    
        If *format* is set, it determines the output format, and the file
        is saved as *fname*.  Note that *fname* is used verbatim, and there
        is no attempt to make the extension, if any, of *fname* match
        *format*, and no extension is appended.
    
        If *format* is not set, then the format is inf

In [None]:
# set your figure size
plt.figure(figsize=(4,4))
# place your plot here
g = sns.scatterplot(data=data, x="body_mass_g", y="bill_length_mm", hue="species", alpha=0.5)
g.set_xlabel("Body Mass (g)")
g.set_ylabel("Bill Length (mm)")
# save
plt.savefig("myfigure.png", dpi=300)
# show
plt.show()

![](myfigure.png)