**PySDS Week 2 Lecture 1.**

In [None]:
from IPython.display import display
%pylab inline

# Key data structures: 
- Numpy Array
- Series
- DataFrame

## Numpy Array 

The numpy array is the basis of the series and data frame objects. It is very efficient. Unlike a list, the objects in an array are of the same type. This allows for considerably faster computation. Here it is worth pointing out that much of python is actually a wrapper for ```c``` code. C is a pervasive, extremely efficient language. That said, it is often cumbersome to use and does not provide anywhere near the level of abstraction of python. Numpy uses C more directly than python lists do.  

We tend not to use the numpy array directly although it can be useful for a number of tricks, as we will show later. One in particular is for generating multiple columns of random numbers. However, for the most part we only interface numpy through PANDAS and not directly. 

A numpy array is designed to implement matrix algebra, something useful in a variety of circumstances. For example, we can characterise a **social network** as a matrix and then use that matrix to learn things about the network. 

Before we get there, however, let's introduce the simple unidimensional array, sometimes called a **vector**. 

In [None]:
import numpy as np 

x = [1,2,3]

npx = np.array([1,2,3])

print(x,npx)

print(x[0],npx[0])


The numpy array can be unidimensional (i.e. just like a single list) or multidimentional. When it is unidimensional it is sometimes referred to as a vector. This is not quite appropriate according to the mathematicians, but it seems to be popular in computer languages. 

A two dimensional array is referred to as a matrix. So if we have a vector of friendship nominations that means we have a one dimensional array representing friendships from that person to the other people. 

If we have four friends, Alice, Bob, Charlie and Diane, they each have a vector referring to whether they are friends with each other. Let's keep each one of these in order of A,B,C,D. So for Alice, if she is only friends with Diane, her vector would look like: 
```
Alice = np.array([0,0,0,1])
```
Whereas Diane might consider herseff friends with everyone. So hers looks like: 
```
Diane = np.array([1,1,1,0])
```
Notice that zero at the end? That's because Diane can't be friends with herself. When you stitch these one dimensional arrays together, you can get a matrix representing the network of friendships, like so: 

In [None]:
Alice = np.array([0,0,0,1])
Bob = np.array([1,0,0,1])
Charlie = np.array([0,1,0,1])
Diane = np.array([1,1,1,0])

friendshipMatrix = np.array([Alice,Bob,Charlie,Diane])

print(friendshipMatrix)

Notice a couple things about the output. First about the structure and second about the semantics. 

1. The structure: 
 - It's not very clear who is who in this matrix. We know that it goes Alice, Bob, Charlie, Diane so we can follow along. But that gets particularly difficult when we have many rows and columns we have to manage. Part of the reason for using PANDAS is that where an array is simplified, a PANDAS DataFrame allows us to have row and column labels, as well as indexing by that label. We will show this in a minute. 

2. The semantics:
 - Notice that we said this was a network of friendships. Well, aren't friends supposed to be symmetric? Bob said Alice was his friend, but Alice did not say Bob was her friend. Drama! What if we had a way to determine whether a friendship is reciprocated? This is where ```numpy``` shines as a means of doing **linear algebra**. 
 - As we go through this example, it will be clear that not only is matrix algebra useful, but that it can be hard to follow without having labels on the rows and columns. So first let's do it, and then we will move over to the nicer data structures with labels.


## How to determine if a friendship is reciprocated

1. To do this we would first flip the matrix around. Right now we have it so that we have rows of 'from' and columns of 'to'. So it is a row of friendship nominations from Alice to Bob, Charlie and Diane. By **transposing**  we can turn this on its head so that it 'to' in the columns and 'from' in the rows.
 - ```newMat = oldMat.transpose()```
 - $ \mathbf{A}^T$
2. Then we can multiply each cell by its corresponding cell in the transposed matrix. If the friendship is unreciprocated, then the result will be $1 * 0$ which is $0$. If it is reciprocated, then it will be a $1$. This will be a matrix of reciprocated friendships. 
 - ```recipMat = oldMat * newMat```
 - $ \mathbf{A}_r = \mathbf{A} * \mathbf{A}^T $
3. Finally, let's remove the reciprocated friendships from the original matrix. What we have left over are the unreciprocated friendships. 
 - ```unrecip = mat - recipMat```
 - $ \mathbf{A}_u = \mathbf{A} - \mathbf{A}_r $

See below: 

In [None]:
# Create a matrix from four vectors
Alice = np.array([0,0,0,1])
Bob = np.array([1,0,0,1])
Charlie = np.array([0,1,0,1])
Diane = np.array([1,1,1,0])

friendMat = np.array([Alice,Bob,Charlie,Diane])
print("The friendship matrix:")
print(friendMat,'\n')

# Get the transpose of that matrix
tMat = friendMat.T

print("The transposed matrix:")
print(tMat,'\n')

# Get the reciporcated friendships
recipMat = tMat * friendMat 

print("The reciprocated friendships")
print(recipMat,'\n')

# Get the unreciprocated friendships
unrecipMat = friendMat - recipMat

print("The unreciprocated friendships")
print(unrecipMat,'\n')

# The SERIES data structure

The Series data structure is very much akin to a vector. It is unidimensional and it classifies everything in the structure as a common type. If it is all integers, the Series will be of type integer. If it is a mix of integers and strings, it will be of type 'object', which is more generic. 

A series has an index which can be automatically created. The indices do not have to be unique, but if they are not, then the coder runs the risk of accidentally indexing the wrong element. We will show how to keep indices tidy later on. 

Let's import the series below: 

In [None]:
from pandas import Series

# Creates a single element series (not four empty rows)
ser1 = Series(4)

print(ser1)

# Creates a series with four of the same elements:

ser2 = Series([1]*4)

print(ser2)

# Creates a series with a range of numbers: 
# Remember with range when you have three arguments it is:
# range(<start>,<exclusive stop>,<step>)

ser3 = Series(range(1,8,2))
print(ser3)

# Create a series with a string. Notice that since it is non-numeric, it's just classed as 'object'

ser4 = Series(["Alice","Bob","Charlie","Diane"])
print(ser4)

## Operations on a series. 

We can operate on every element in a series directly. Whereas with a list if we type ```list1 * 2``` the result will be the list, only doubled. But if we do it for a series, we will multiple _every element_ ```* 2```. See below: 

In [None]:
from pandas import Series

# This here is a 'function' - we call this function with an
# argument. This means we send it input (ListToDouble) and
# and it returns output.

def doubleUpDemo(listToDouble):
    print("Here's a list * 2")
    print(listToDouble*2)
    print() 
    
    ser1 = Series(listToDouble)
    print("Here's a series * 2")
    print(ser1 * 2)
    print()
    return 

# First we send a list of integers to the function
doubleUpDemo([1,2,3,4])

# Next we send a list of strings
doubleUpDemo(["a","b","c","d"])

Notice that when we used strings, the cells doubled the string inside the list. This is because the ```*``` operator is **overloaded** which means that it refers to multiple potential operations depending on the context. The ```+``` symbol is also overloaded as we already know. It can mean both plus and concatenate. If we tried that with an operator that is not overloaded, such as exponent, then we would have got an error. See below: 

In [None]:
from pandas import Series

print("Here's a series of numbers to the second power")

ser1 = Series([1,2,3,4])
print(ser1 ** 2)
print()

print("Here's a series of strings to the second power")

ser2 = Series(["a","b","c","d"])
print(ser2 ** 2)
print()

## Series and indices 

Every series has an index for each of the elements in the series. The index itself is available through ```<seriesName>.index```. The index is mutable, so you can either create new names for your index when you create your series or do it later on. You can also reindex a series, which is important if you're concatenating two series. 

A series is **ordered** so we can index every element by its position in addition to indexing it by the index name. 

In [None]:
from pandas import Series 

ser1 = Series(["a","b","c","d"], index = ["alpha","bravo","charlie","delta"])
print(ser1,"\n")
print("Here is the first element:",ser1[0],"\n")
print(ser1,"\n")
print("Here is the element from index 'alpha':",ser1["alpha"])

Just because it is ordered and you, in theory, can index it by position, _you really shouldn't_. Just watch what happens when we give the index numerical values in the wrong order. When we try to index element 0 we get 'c' and not 'a' as we got above. Instead, you should always index either by name if you need to access the values in a series, or simply in order.

That being said, positional numbers are still really useful for slicing and will always work as expected. 

In [None]:
ser1.index = [1,4,0,2]
print(ser1)

print("By position?")
print(ser1[0],"\n")

print("Slicing up to the third element.")
print(ser1[:2],"\n")

print("Slicing from third element onwards.")
print(ser1[2:],"\n")

## Ways to create a series

We already saw how to create a series from a list as well as an index from a list. If you have a dictionary, you can also turn it into a series. It will keep the key as the index and the value as the value in the cell. See below: 

In [None]:
from pandas import Series

# You can also create a series with an index in one go using a dictionary. 

dict1 = {"alpha":"a","bravo":"b","delta":"d","epsilon":"e"}
ser1 = Series(dict1)

print(ser1)
print()

If your series has a **misalignment** between the length of the collection of values and the length of the index, pandas will try to infer what to do. Typically this involves throwing an error if the index and the series are not of the same length. 

In [None]:
from pandas import Series

values1 = [1,3,5]
index1 = ["Apples","Oranges","Bananas","kiwis","durian"]

ser1 = Series(values1,index=index1) 
print(ser1)

## Filtering a series

There are many ways to filter a series. Two featured here involve **slicing** and **Boolean logic**.

### Slicing 

Just like how a list can be sliced, we can similarly slice a Series.

In [None]:
from pandas import Series

ser1 = Series(["a","b","c","d"])

ser1.index = [1,4,0,2]
print(ser1,"\n")

print("By position?")
print(ser1[0],"\n")

print("Slicing up to the third element.")
print(ser1[:2],"\n")

print("Slicing from third element onwards.")
print(ser1[2:],"\n")

### Boolean Logic
If you recall, Boolean logic allows you to evaluate the logical truth condition of a statement. So if ```x = 4``` and ```y = 4``` then ```x == y``` will be true. With a series, instead of returning whether _the series_ is true or false, it evaluates each cell and returns a new series of True and False values that satisfy that condition. So if we have a series:
~~~py 
ser1 = [1,2,3,4,5]
~~~
Then asking:
~~~py
ser1 > 3
~~~
will return a series with only those values greater than 3. See the example below: 

In [None]:
from pandas import Series

ser1 = Series(["a","b","c","d"])
ser2 = Series([1,3,5,7,9,11])

# We can filter a series in lots of different ways. 

# Every time you evaluate a series by boolean logic it returns a series of that length true / false
print(ser1 > "c")
print()

print(ser2 > 5)
print() 

# You can then apply this to your original series to filter out the false entries. 
ser2q = ser2 > 5
print(ser2)
print()
print(ser2q)
print("\nThe new slimmer series\n")
print(ser2[ser2q])
print()

Slicing and filtering is especially useful if you have some missing data and you want to delete the cases "listwise", meaning exclude a row. So instead of using a boolean, you would use the function:

~~~python 
series.notnull() 
~~~

which will return true for all the non-null values. You can also use the opposite function: 

~~~python 
series.isnull() 
~~~


In [None]:
ser3 = Series([1,4,7,None,8,9])
print(ser3)

print(ser3.isnull())
print()

print(ser3[ser3.notnull()])

## Key Series Operations 

There are a number of operations you can do on a series. You can see the lot of them by typing: 
~~~python
dir(Series)
~~~

We are here focusing on a handful of these for data processing: 
- value_counts()
- unique() 
- sort() and reindex() 

### Value Counts

```value_counts()``` returns a new series where the earlier values are now indices and the values are the counts of that new index. So if you have a Series with the following numbers:

~~~python
ser1 = [1,1,7,7,7,33,1,6,33] 
~~~

Then you have 3 of the number 1, 4 of the number 7, 2 of the number 33 and one 6. 

To see this summarised, type 

~~~python
print (ser1.value_counts()) 
~~~

As ```value_counts()``` returns the new series, we can print it directly. See below: 

In [None]:
from pandas import Series 

ser1 = Series([1,1,7,7,7,33,1,6,33])

print(ser1,"\n")
print(ser1.value_counts(),"\n")

# Since a string is a list of charcters this is 
#also a quick way to get a count of characters in a string. 

ser2 = Series(list("the quick brown fox jumps over the lazy dog"))
print(ser2,"\n")
print(ser2.value_counts(),"\n")

In [None]:
#check unique - notice that there's 27 when qbf should have 26 letters. 

So in the previous example, the quick brown fox... was turned into a series that descended in order from the most frequent to the least frequent. 
1. What if we want to have it sorted alphanumerically? 
2. What if we only want counts of valid alphanumeric characters and not spaces? 

For the first one we can use ```sort_index()``` to resort the numbers by index. 

In [None]:
from pandas import Series

ser1 = Series(list("the quick brown fox jumps over the lazy dog"))
ser1 = ser1[ ser1.map(lambda x: x.isalpha()) ]
ser1
# # print(ser1,"\n")

# # print(ser1.value_counts().sort_index())


# my_valuable_values = filter(lambda x: x != "-", my_values)
# print(ser1[ser1.isalpha()].value_counts())
# # print(ser5.value_counts().sort_index())
# # # What's the 8 at the top mean?


In [None]:
# We can use value counts to plot a summary of data as well. 

ser5.value_counts().sort_index(ascending=False).plot(kind="barh")

# The DataFrame data structure

DataFrames can be thought of as aggregates of series. They are tabular data structures. 

In [None]:
# One dimensional data frame with no indices or column labels 
df1 = DataFrame([1,2,3,4,5])
print(df1,"\n")

# Two dimensional data frame with no indices or column labels. 
# Note each 'inner list' in the list is treated as a row. this is why they will come out horizontal.
df2 = DataFrame([[1,2,3,4,5],[2,5,10,17,26]])
print(df2,"\n")

# Here we can see a data frame of rows
# Notice how PANDAS handles the missing value
df3 = DataFrame([[1,2],[2,5],[3,10],[NaN,4],[5,26]])
print(df3,"\n")

# Let's replace the index for this data frame
df3.index = ["first","second","third","fourth","fifth"]
print(df3,"\n")

# Let's replace the column labels.
df3.columns = ["number","sq_plus_1"]
print(df3,"\n")



In [None]:
df3.sort_values("number",ascending=True,inplace=True)

print(df3)

sortedSQplus1 = df3.sort_values("sq_plus_1",ascending=False)

print(sortedSQplus1)

In [None]:
# Getting Data in a Data Frame 

df_pol = DataFrame.from_csv("WD18_PolCandidates.csv")
df_pol.head(3)

# Basic - look what happened! 
# Name is in its own row, the names are now indices.
# We want to tell the parser that we want to keep names not as an index.

In [None]:
import os
print(os.getcwd())

In [None]:
# This should work just fine.
df_pol = DataFrame.from_csv("WD18_PolCandidates.csv",index_col=None)
# df.head() just prints the first n rows (5 by default)
df_pol.head()
# Okay, that's much better. 

In [None]:
df_pol[(df_pol["party"] == "Labour Party") | (df_pol["constituency"] == "Aberavon")]

In [None]:
# How many people tweet by party? 
# What is the ratio of people who tweet by party? 
# Report on only those parties with more than 10 people running for office.  

# Series 1. How many pepople per party. 
partyCount = df_pol["party"].value_counts()
print(partyCount.head(10))
# print()

# Series 2. How many people per party have a twitter account
haveTwitter = df_pol["twitter_username"].notnull()
# print(haveTwitter)
partyCountWithTwitter = df_pol[haveTwitter]["party"].value_counts()
# print(haveTwitter.head(5))
# thingsToDisplay = "party"

# thingsToDisplay2 = ["party","gender","name"]


# print(df_pol[haveTwitter][thingsToDisplay].value_counts())
display(partyCountWithTwitter)
# print(partyCountWithTwitter.head(10))



In [None]:
# Creating a new DataFrame with these two series together
# Notice that .T means transpose. This is the simplest way to swap rows and columns.
# newarray = list(zip(list(partyCount),list(partyCountWithTwitter)))
# print(newarray)

# l1 = [1,2,3,4]
# l2 = [5,6,7,8]
# l3 = zip(l1,l2)
# print(list(l3))
df_parties = DataFrame([partyCount,partyCountWithTwitter],index=["Party Count","Have Twitter"])
df_parties
df_parties.T

# We want to transpose! 

Transposition in linear algebra is taking the rows and making them the columns (and vice versa)

so: 
~~~
a b c d
e f g h
~~~
becomes:
~~~
a e
b f
c g
d h
~~~

To do this in a DataFrame we would just add .T at the end. 

print(DataFrame.T) will print a transposed DataFrame

DataFrame = DataFrame.T will make the dataframe permanently transposed. 

In [None]:
df_parties = df_parties.T
print(df_parties.head())

In [None]:
# Creating a new variable that is the result of two other variables. 
df_parties["proportion"] =  df_parties["Have Twitter"] / df_parties["Party Count"]

df_parties.sort_values("proportion",inplace=True,ascending=True)

# We could just print it, but it looks nicer to use the HTML. 
# Compare: 

display(df_parties[df_parties["Party Count"] >= 10])

# pd.options.display.float_format = '{}'.format
# You can use this code to change the display format per table. 
# Unfortunately, it doesn't work per column. You will have to seek elsewhere for that.
# pd.options.display.float_format = '{:.3f}'.format

# display(df_parties[df_parties["Party Count"] >= 10])

# for i in df_parties.index:
#     if i in ["National Front", "British National Party"]:
#         display(df_parties.loc[i])

In [None]:
# What about counts of a categorical, say, the percent of candidates who were women? 
# There are lots of ways to do this. I'm going to use a 'map' to map the gender on to a binary

# What's the gender column called? 
print(df_pol.columns)

In [None]:
# Okay, it's "gender". 
df_pol["gender"].value_counts(dropna=False)

In [None]:
# Uh oh, it seems that the gender was entered in a number of ways.
# What's up with Female there twice? 
# Let's check
print(df_pol["gender"].value_counts().index)

In [None]:
# A-ha! one has "Female " and one has "Female"
# Let's turn this into a binary variable 
mapper = {
    "male":0,
    "Male":0,
    "Man (sex)":0,
    "female":1,
    "Female":1,
    "Female ":1
}

df_pol["bgender"] = df_pol["gender"].map(mapper)
# male = df_pol["gender" == ""]
# partyCountWithTwitter = df_pol[haveTwitter]["party"].value_counts()
df_pol["bgender"].value_counts(dropna=False)

In [None]:
# Okay, so we can see that 1035 were women and 2932 were men, 4 were undefined in the data.
# Are some parties more gender balanced than others?

# First lets 'group' the data
polGroup = df_pol.groupby("party")

# groupby creates a representation of the same data, not new data. 
# We can use this representation to create aggregates. 
# The [:10] just means give me the first ten. 
print(polGroup["bgender"].mean())

# Ok, so this looks coherent. Why don't we append this to our parties dataframe?
# That way we can sort in lots of ways, create new varibles for the data frame and more. 
df_parties["gender_ratio"] = polGroup["bgender"].mean()

# This will display the whole table sorted by Gender Ratio:
display(df_parties[df_parties["Party Count"] > 10].sort_values("gender_ratio",ascending=False))

# Notice in this one, we are only going to sort and display the gender_ratio series. 
# Do you understand the difference between this code and the code above? 
display(df_parties[df_parties["Party Count"] > 10]["gender_ratio"].sort_values(ascending=False))

In [None]:
plot_title = "Percent of Female Candidates by Party (UK 2015 General Election)"
df_parties[df_parties["Party Count"] > 20]["gender_ratio"].sort_values(ascending=True).plot(kind="barh",title=plot_title)

In [None]:
#### Optional note for LaTeX users! ####
#
# We can format these things for LaTeX as well. 
# Remember the last query we did - just add .to_latex() at the end
print(df_parties[df_parties["Party Count"] > 10].sort_values("gender_ratio",ascending=False).to_latex())



# Data Types for input / output

As a prelude to next week, I'm now introducing a number of different file formats. We will look at these next week, how to get them into and out of Python. 

- json
- xml
- sql
- csv
- Mircosoft Excel (xls and xlsx)
- serialization

## JSON (JavaScript Object Notation)

JavaScript Object Notation is a very lightweight format for downloading and storing data from the web. Many APIs use JSON for their file interchange. Considered in terms of python, it is basically just a series of unicode lists and dictionaries. 

## XML (eXtensible Markup Language) 

XML is a language for marking up data. Like other 'ML's such as as HTML, there is a header and a body. The header defines many things about the data, and then the body uses tags to signify the data and some properties about it. XML is very verbose as every piece of data is tagged in some way. 

## SQL (Structured Query Language)

SQL is the standard language for querying relational databases (where data is stored in linked tables). Later, we will look at a little bit of data in an SQLite database. That is a data structure that is useful for manging large amounts of data, although more of this will be covered in the Big Data Analytics class. 

# CSV (Comma Separated Values)

Comma-separated values are a traditional way to encode data. It works very simply by using return characters to denote rows and the commas to denote columns. It is possible to read data into Python as a CSV the hard way, but by using the _csv_ package it is possible to import a lot of data programmatically.

# Serialization - Pickling your data

In most langauges there's a way to take a data structure as is and simply write it to a file so that when the file is read it will load the data right up the way it started. Next week we will be pickling and unpickling files. 
