[![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nunososorio/bhs/edit/main/Data_structures/NB3_Hands_on_students.ipynb)



# Notebook 3 - A hands-on example with [AnnData](https://anndata.readthedocs.io/en/latest/) and [Scanpy](https://scanpy.readthedocs.io/en/stable/)

In this notebook we will use data from an everyday example to populate an AnnData object and analyze it with Scanpy

<br/><br/>

We will use [IPMA](https://www.ipma.pt/pt/index.html) weather data from different cities in Portugal, see what are the main differences between them and how they group together
<br/><br/>
The data is parsed from Wikipedia, and you have it stored in three convenient csv files:
- **Pt_cities.csv**: region, population and area of each city
- **Pt_temp.csv**: temperature data
- **Pt_rain.csv**: rainfall data

<img src="https://www.ipma.pt/opencms/system/modules/ipma.website/resources/images/logo-ipma-17.svg" alt="AnnData" style="width:600px; height:auto;"/>


# Setup the environment

The *basic* libraries (Numpy, Pandas, Matplotlib, Seaborn...) are already installed in Google Colab. To run this notebook you will need to install scanpy and anndata

In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install scanpy

In [None]:
# Import all the libraries we will use
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc

In [None]:
# Some details for the plots
plt.rcParams.update({'font.size':18, 'figure.figsize':(8,8)})

In [None]:
if not os.path.exists('Pt_cities.csv'):
    !wget https://raw.githubusercontent.com/Leo-GG/bhs/main/Data_structures/Weather/Pt_cities.csv
if not os.path.exists('Pt_temp.csv'):
    !wget https://raw.githubusercontent.com/Leo-GG/bhs/main/Data_structures/Weather/Pt_temp.csv
if not os.path.exists('Pt_rain.csv'):
    !wget https://raw.githubusercontent.com/Leo-GG/bhs/main/Data_structures/Weather/Pt_rain.csv

# Load the data
Create a DataFrame from each of the files

In [None]:
# Cities "metadata"
df_cities=pd.read_csv('Pt_cities.csv', index_col=0)  # Specify that we want the first column to be used as index

# Temperature values
df_temp=pd.read_csv('Pt_temp.csv', index_col=0)

# Rainfall data
df_rain= WRITE_YOUR_CODE_HERE


In [None]:
df_cities.head(3)

In [None]:
df_temp.head(3)

In [None]:
df_rain.head(3)

#Basic questions using DataFrames

In [None]:
# Which city is the 3rd largest by population?

# You can use the sort_values() function to quickly view the df_cities DataFrame ordered by the values of any column
df_cities.sort_values('WRITE_YOUR_CODE_HERE')

In [None]:
# Which city is the 4th smallest area?

WRITE_YOUR_CODE_HERE

In [None]:
# How much was the total rain on average in January?

# You can see the column names so you know what you are looking for...
df_rain.columns

In [None]:
df_rain[ WRITE_YOUR_CODE_HERE ].WRITE_YOUR_CODE_HERE()

# Mixing the data
The temperatures are two different "modalities", so they are in different scales. If we want to analyze them toghether, we should at least scale the values to have everything in the same range.
<br/><br/>


<img src="https://github.com/Leo-GG/bhs/blob/main/Data_structures/Illustrations/apploranges.jpg?raw=true" alt="AnnData" style="width:600px; height:auto;"/>

<br/><br/>
*or we could use multi-omics if we had more time...

In [None]:
# Import a scaler from sklearn. This scaler will "fit" and "transform" our data to the interval (0,1), where 0 is the minimum value and 1 is the maximum
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

#First the temperature
scaled_temp = scaler.fit_transform(df_temp)
scaled_temp = pd.DataFrame(data= scaled_temp , columns=df_temp.columns, index=df_temp.index)

# Then the rain
scaled_rain = pd.DataFrame(data = scaler.fit_transform(df_rain), columns=df_rain.columns, index=df_rain.index)


In [None]:
scaled_rain.head(3)

In [None]:
# Now put everything toghether

# We can do this with the merge() function from pandas. This will join dataframes using the index or any could that we specify
df_w=scaled_temp.merge(scaled_rain, left_index=True, right_index=True)

# Merging DataFrames is extremly useful!!

In [None]:
df_w.shape

In [None]:
df_w.head(3)

#Make an AnnData object

Now we have
- df_w: A DataFrame with numerical values
- df_cities: A DataFrame with "metadata" about the cities

Can you put them toghether into an AnnData object?

In [None]:
# Create the AnnData object
my_adata=sc.AnnData(X= WRITE_YOUR_CODE_HERE , obs= WRITE_YOUR_CODE_HERE)

In [None]:
# Check its dimensions and variables
my_adata

In [None]:
# Check the variable names
my_adata.var_names

In [None]:
# Let's see the hottest places in Summer (Jun, Jul, Aug)

sc.pl.heatmap(adata = my_adata, var_names=['Daily_mean_Jun','Daily_mean_Jul','Daily_mean_Aug'], groupby='City_name', swap_axes=True,figsize=[14,6])

In [None]:
# Can you make a plot to see the coldest REGION in the winter months (Dec, Jan, Feb)?

sc.pl.heatmap( adata = WRITE_YOUR_CODE_HERE, var_names = [ 'WRITE_YOUR_CODE_HERE','WRITE_YOUR_CODE_HERE', ...] , groupby = 'WRITE_YOUR_CODE_HERE', swap_axes=True,figsize=[14,6] )


# How is the weather?
Now run a statistical test to see the largest differences between the regions.

In [None]:
# Run a comparison between different regions using a Wilcoxon test
sc.tl.rank_genes_groups(adata= WRITE_YOUR_CODE_HERE, groupby= 'WRITE_YOUR_CODE_HERE', method='wilcoxon')

In [None]:
my_adata

In [None]:
# Now let's see the results; plot the first three features that more more distinctive of each region
# Note that the comparison was done between each region and ALL the others!
sc.pl.rank_genes_groups_heatmap(adata =  WRITE_YOUR_CODE_HERE, n_genes = WRITE_YOUR_CODE_HERE, groupby = 'WRITE_YOUR_CODE_HERE', swap_axes=True,figsize=[10,12])

# Plotting data in 2d
Now use Scanpy to run PCA on the data

In [None]:
# Apply PCA, use just 10 components
sc.pp.pca(adata = my_adata, n_comps=10)


In [None]:
# Visualize the PCA loadings
sc.pl.pca_variance_ratio(adata = my_adata, log=True, n_pcs=10)

Plot the data projected on the first two PCs (the default option). Color by region, city name and some of the distinctive features

In [None]:
# Visualize the data projected on the PCs, color by region and city
sc.pl.pca(adata = my_adata, color=[ 'WRITE_YOUR_CODE_HERE' , 'WRITE_YOUR_CODE_HERE'], wspace=0.4)


In [None]:
# Make another plot coloring by some of the distinctive features of each region
sc.pl.pca(adata = my_adata, color=['WRITE_YOUR_CODE_HERE', 'WRITE_YOUR_CODE_HERE', 'WRITE_YOUR_CODE_HERE', 'WRITE_YOUR_CODE_HERE'])