# Introduction to Python - Session 3
1. Installing and using packages
2. Data wrangling:
    - The numpy package
    - The pandas package

SLIDES [HERE](https://docs.google.com/presentation/d/11vBUnU8YTIUY84iuW1UhCj0VYkPXEtmh-rdxV33yh_k/export/pdf)

## EXERCISE 1 - Introduction to NumPy

In [None]:
import numpy as np

**1. Create an array `a` of random numbers and shape (3,4).**

**2. Add a fifth column to `a` with values 0, 0.5, and 1.**

**3. Find all values that are greater or equal to 0.5.**

**4. Replace all the first row with NAs.**

**5. Use matrix multiplication against the vector `b = np.array([1, 0, 10])`.**

**6. Element-wise multiplication of the same vectors `a` and `b`.** Note that `b` is broadcasted along all rows.

**7. Calculate the sum, the mean, and the median of each row of `a`. Use the so-called numpy functions.**

## EXERCISE 2 - Introduction to Pandas

In [None]:
import pandas as pd

**1. Create the following DataFrame `mydf`, with index `John, Jessica, Steve, Rachel` and columns `Age, Height, Sex`.**

```
43 	181 	M
34 	172 	F
22 	189 	M
27 	167 	F
```

**2. What is the shape of `mydf`?**

**3. Calculate the average age and height in `mydf`.**

**4. Add one row to `mydf`: Georges who is 53 years old, 168cm tall, and Male.**

**5. Change the row names of `mydf` so that the data becomes anonymous.** Use Patient1, Patient2, etc. instead of actual names.

**6. Create the DataFrame `mydf2` that is a subset of `mydf` containing only the female entries.**

**7. Import the data in `more_patients.tsv` in a DataFrame named `moredf`.**

**8. Create a DataFrame `mydf3` by concatenating `mydf` and `moredf`.**

**9. Calculate the number of male and female patients combining the `.groupby` and `.size` methods in `mydf3`.**

**10. Calculate the average age and height by sex combining the `.groupby` and `.mean` methods in `mydf3`.**

**11. Calculate the average age and height by sex using the `.groupby` and `.apply` methods in `mydf3`.**

**13. Standardize age and height by sex combining the `groupby` and `apply` methods in `mydf3`.**

## EXERCISE 3 - Analyzing COVID-19 data

Adapted from: https://www.w3resource.com/python-exercises/project/covid-19/index.php

Data Source: https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports

**File naming convention**

MM-DD-YYYY.csv in UTC.

**Field description**

- Province/State: China - province name; US/Canada/Australia/ - city name, state/province name; Others - name of the event (e.g., "Diamond Princess" cruise ship); other countries - blank.
- Country/Region: country/region name conforming to WHO (will be updated).
- Last Update: MM/DD/YYYY HH:mm (24 hour format, in UTC).
- Confirmed: the number of confirmed cases.
- Deaths: the number of deaths.
- Recovered: the number of recovered cases.

**Upload the latest update of the dataset.**

**1. Write a Python program to display first 5 rows from COVID-19 dataset. Also print the dataset information (`info()`) and check the missing values (`isna()`).**

**2. Write a Python program to get the latest number of confirmed, deaths, recovered and active cases of COVID-19 country-wise.** HINT: You can use the `groupby` fucntion.

**3. Write a Python program to get the Spanish `Province_State` cases of confirmed, deaths, recovered and active cases of COVID-19. Use `sort_values` to sort the values. Save the resulting dataframe as a csv file.**

**4. Make a bar plot of the deaths of the previous DataFrame.** Pandas has some very simple plotting function for DataFrames included, which can often be very convenient. Here, you can use the `DataFrame.plot.bar()` function. For more compplicated plots, the package MatPlotLib is recommended.

**5. Make a scatter plot of confirmed cases againts deaths for all `Province_State` of the previous DataFrame.** Use the `DataFrame.plot.scatter()` function.

## EXERCISE 4 - Gene annotation GFF3

[GFF is a standard file format](http://gmod.org/wiki/GFF3) for storing genomic features in a text file. GFF stands for Generic Feature Format. GFF files are plain text, 9 column, tab-delimited files.

The 9 columns of the annotation section are as follows:

- Column 1: "seqid" - The ID of the landmark used to establish the coordinate system for the current feature, a.k.a. chromosome name.
- Column 2: "source" - The algorithm or operating procedure that generated the feature.
- Column 3: "type" - The type of feature.
- Columns 4 & 5: "start" and "end" - The start and end of the feature.
- Column 6: "score" - The score of the feature, a floating point number.
- Column 7: "strand" - The strand of the feature.
- Column 8: "phase" - For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame.
- Column 9: "attributes" - A list of feature attributes in the format tag=value.

**1. Load the data in "GRCh38.gff3", which contains a random subset of features of the human genome. Show the first 5 instances.**

**2. Which types of features are included in the dataset? How many of each? Make a barplot showing these numbers.**

**3. Create a new column "len" that contains the length of each feature.**

**6. Extract the gene name of all instances from the "attributes" column. Include it in a new column.** HINT: You can use `^` in the regular expression.

**5. Microexons are defined as exons shorter or equal than 27 nucleotides. Find all microexons in the dataset.**

**6. Plot a histogram of the length of microexons. Use `plot.hist()`.**

## EXERCISE 5 - GDP dataset

The analysis was prepared based on the World Bank Data, particularly the dataset [World Development Indicatiors](http://databank.worldbank.org/data/reports.aspx?source=world-development-indicators) was utilized. This set contains many different economic development indicators you can choose from. For simplicity, we will use: GDP per capita (US\\$), GDP per capita growth (annual \%), GDP growth (annual \%), GDP (current US\\$).

**1. Load "GDP_last25years_08182020.csv" dataset. Missing data is written as "..", interpret it as NaN. Set the index of the DataFrame to "Series Name" and "Country Code" (multi-indexes are allowed in Pandas). Show the first five lines.**

**2. Note that column names are formated as "XXXX [YRXXXX]". Reformat it to XXXX.**

**3. Print the GDP (current US\\$) of Spain.**

**4. Which country has the higuest GDP per capita in 2019?**

**5. Make 4 plots: GDP per capita (US\\$), GDP per capita growth (annual \%), GDP growth (annual \%) and GDP (current US\\$) over the years. You will need to transpose the data with `T`,**

**6. To investigate whether different countries show the same trend over the years, make a correlation matrix of GDP per capita (current \\$US) using `corr()`.**