# Introduction into Data Analysis with Python 
 Basics (adapted from http://swcarpentry.github.io/python-novice-inflammation/)

This Jupyter notebook contains the introduction on Python

Authors:
Demi Vasques
(Jaap Geraerts)

#### 1. Storing data in Python

Besides variables, there are other ways of storing data in Python. The two most common are:

* lists
* dictionaries

While with lists we can store multiple values (but only values!), with dictionaries we can store values that are associated with keys! Let's see an example

In [None]:
# below we have two lists: one with historical actors and the other with their birth dates
historical_actors = ['Isabella I of Castile','Napoleon Bonaparte','Catherine the Great','Martin Luther','Queen Victoria']
birth_dates = ['04/22/1451','08/15/1769','05/02/1729','11/10/1483','05/24/1819']

In [None]:
# we can create a single dictionary with these two lists, with historical actors as keys and
# their respective birth dates as values
birth_actors = {'Isabella I of Castile':'04/22/1451','Napoleon Bonaparte':'08/15/1769','Catherine the Great':'05/02/1729',
                'Martin Luther':'11/10/1483','Queen Victoria':'05/24/1819'}

#### 2. Manipulating data

Two very basic and also very practical ways of manipulating data are:

* slicing
* for loops

##### Slicing

**Very, very important** - In Python the index (the position of a value) starts at 0, so when slicing we have to keep this in mind! The first value of a list, for instance, has index 0!

In [None]:
# the entire list (our dataset)
print(historical_actors)

In [None]:
# one may be interested only in the first three values of the dataset
print(historical_actors[0:3]) # the first limit (the value before the ':') is included, but the second limit is not!
print(historical_actors[:3]) 

In [None]:
# or perhaps, in the last four values of the data
print(historical_actors[-4:])

In [None]:
# or yet, only in the values in specific positions
print(historical_actors[1:3])
print(historical_actors[2:4])

##### For loops

This is a technique used when we want to repeat the same task, several times, as for instance, for every value of the dataset. There are two main ways of performing for loops:

In [None]:
# first, we can 'call' values directly
for actor in historical_actors:
    print(actor)
    
for date in birth_dates:
    print(date)

In [None]:
# second, we can 'call' values using their indexes 
# this is particularly useful when we have more than one list, for example

for i in range(len(historical_actors)): 
    print('The birth of', historical_actors[i], 'was on', birth_dates[i])
    
# this reads as: for every index in the range of the length of the list containing the historical actors,
# print their name and their birthdate

#### 3. Loading data

Like everything so far, there exist many forms of importing (or loading) data to Python. Here we are going to use:

* the **pandas** package (https://pandas.pydata.org/) for data analysis
* **CSV** files (https://en.wikipedia.org/wiki/Comma-separated_values) for storing tabular data

In [None]:
import pandas as pd # importing a package tha deals with dates

# let's load the the historical actors into a Python list
historical_actors = pd.read_csv('actors_birthdate.csv')['historical_actors'].values.tolist()

# and load the birth dates as well
birth_dates = pd.read_csv('actors_birthdate.csv')['birth_dates'].values.tolist()

In [None]:
print(historical_actors)
print(birth_dates)

#### 4. Example: creating a network
Now let's create a network of letters exchanged by historical actors, supposing that they could send letters to anyone in history, from past and future :)

In [None]:
# first, we import the packages we believe we might need
import networkx as nx
import pandas as pd

# then, we define a function to create the network from our data
# a function itself has some steps within it
def create_network_of_letters(filename):
    
    # loading the data
    senders = pd.read_csv(filename)['senders'].values.tolist()
    nations = pd.read_csv(filename)['nationality'].values.tolist()
    recipients = pd.read_csv(filename)['recipients'].values.tolist()
    
    # initiating a graph (network)
    G = nx.Graph()
    
    # adding nodes and links to the network
    for i in range(len(senders)):
        G.add_node(senders[i],birthplace=nations[i],name=senders[i])
        G.add_node(recipients[i])
        G.add_edge(senders[i],recipients[i])
        
    # return the final network    
    return G

# now we call the function to create the network with our date
# we need to give as input the name of the file where our data is stored
G = create_network_of_letters('letters_exchange.csv')

# after the network is created, we can check the degree of each node
# that is, an actor has corresponded with how many other actors
print(G.degree())

#### 4. Example: visualisation

Again, there are a myriad of possibilities for data visualisation too. We will now see a simple script to visualise the network, that we just created, of letters exchanged between historical actors of different periods.

In [None]:
# first, again, we import the packages that might be useful
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
%matplotlib inline

# we set up the visualisation environment
fig = plt.figure(figsize=(8,6), dpi=500)
ax = fig.add_subplot(1,1,1)

# we calculate the position of the nodes in the layout
pos = nx.fruchterman_reingold_layout(G)

# we define the colors of the nodes based on their nationalties
colors = {'Georgia': '#1a1a1a', 'Austria': '#EF4A42', 'US': '#9E9E9E', \
          'Italy': '#098137', 'France': '#501557', 'Greece': '#770808', \
          'Germany': '#d82a20', 'Unknown': '#FDE401', 'UK':'#00529F'}

# we draw the network
nx.draw_networkx_nodes(G, pos=pos, node_size=[d**3 for n,d in G.degree()], alpha=0.9, \
                       node_color=[colors[G.node[node]['birthplace']] for node in G])
nx.draw_networkx_labels(G, pos=pos, labels=nx.get_node_attributes(G, 'name'), font_size=9)
nx.draw_networkx_edges(G, pos=pos, width=1,alpha=0.3,edge_color='b')

# we show the network on the screen
plt.tight_layout()
plt.axis('off')
plt.show()

In [None]:
# auxiliary script to create data of letters exchanged
import pandas as pd
import csv
import random
import numpy as np

senders = pd.read_csv('actors_birthplace.csv')['Senders'].values.tolist()
places = pd.read_csv('actors_birthplace.csv')['Nationality'].values.tolist()

send = []
nat = []
recip = []

for i in range(len(senders)):
    n = random.choice(np.random.geometric(1/float(4),15))
    send.extend([senders[i]]*n)
    nat.extend([places[i]]*n)
    for i in range(n):
        r = random.choice(senders)
        recip.append(r)
        
rows = zip(send,nat,recip)

with open('letters_exchange.csv','w') as f:
    writer = csv.writer(f)
    writer.writerow(['senders','nationality','recipients'])
    for row in rows:
        writer.writerow(row)