In [2]:
%matplotlib inline 

# Intro to Python for Social Scientists

This tutorial provides an introduction to programming in Python, along with a few introductory examples on how Python is generally used in social science research. We will cover: 


- Data types: integers, floats, strings, booleans
- Data structures: lists, sets, dictionaries and tuples
- Loops
- Conditional statements
- Writing functions
- Reading and writing data
- Importing third party modules
- Working with data in different formats
- Basic visualization
- Additional resources


## Variables, data types and operators
You create a new variable by simply declaring it.

In [3]:
a="Hello World!" #a string variable. Strings need to be placed in single or double quotation marks. 
b=2 #an integer variable
c=2/3 #a float variable
d=(b==24) #a boolean variable

To print to console:

In [4]:
print(a)

Hello World!


In [5]:
b

2

In [6]:
print(c+b)

2.6666666666666665


In [7]:
d

False

You should always know what type your variables are, since some operations can only be done on certain types of variables. To check variable types: 

In [8]:
print("a is", type(a), ",b is", type(b),  ",c is", type(b),  ",d is", type(d) )

a is <class 'str'> ,b is <class 'int'> ,c is <class 'int'> ,d is <class 'bool'>


### Operators
Mathematical, comparison and boolean operations and their order or evaluation:  
1. exponent: \**
2. multiplication, division, modulo \*, /, \% 
3. addition, subtraction +, -	
4. comparison operators <, <=, >=, >, ==, !=
5. comparison operators: is, is not, in, not in	
6. boolean NOT, AND, OR: not, and, or	

Use () to change the default order. This is just maths. 

In [9]:
2**b+20/b<=15


True

In [10]:
2**(b+20)/b<=15

False

In [11]:
d==False

True

In [12]:
d is not True

True

In [13]:
d is not True and b==3

False

In [14]:
d is not True or b==3

True

In [15]:
result=(b+c)-(d*2)
result

2.6666666666666665

You can use some of these operators on strings as well. 

In [16]:
"Hello" in a

True

But try: 

In [17]:
x="2"
y="3"
result=x+y
print(result)

23


For strings, '+' performs concatenation. 

In [18]:
type(result)

str

Now try this:


In [19]:
x="2"
y=3
result=x+y
print(result)

TypeError: must be str, not int

What does the error say? 

### Converting variables from one type to another
Sometimes, your data has variables that you would like to use as numbers coded as string, just as we have x and y above. Or some of the variables are coded as strings, while others are numbers, although you believe they should all be numbers. If you try to add them however, you get an error saying that you can't add strings and numbers. Assuming all the values of the dataset variables look like numbers, you can convert them into integers or floats. Or the other way around. Now try this:

In [20]:
x="2"
y="3"
result=int(x)+float(y)
print(result)

5.0


In [21]:
type(result)

float

And back to string:

In [22]:
type(str(result))

str

### Exercise: 
1. Create a new variable called 'birth_year' that contains your year of birth.
2. Using your birth year, calculate your age and assign it to a new variable called 'age'.  
3. Print a sentence of the form "I am *age* years old." to the console.
4. Create a new string variable called 'sentence' that contains this statement. 

Write your code in the box below. Let's see who finishes first! 

## Data Structures 
Note that the variable we've been working with so far contain a single value. However, what we normally refer to as "variables" in data analysis are variables from datasets, which contain more than one value. In python, these types of data structures can be lists, sets, dictionaries and tuples. 

### Lists
Lists are stored between square brackets, and the elements are separated by commas. Here is a list of ages:

In [23]:
ages=[21, 20, 19, 21, 20, 33, 22, 23, 26, 21, 22, 30, 19, 28]
ages

[21, 20, 19, 21, 20, 33, 22, 23, 26, 21, 22, 30, 19, 28]

In [24]:
len(ages) # this is the number of elements in the list

14

Lists can be indexed and sliced: 

In [25]:
# Indexing - getting an element by position. Note that we start from 0 and we stop at len(list)-1. 
first_element=ages[0] # this is the element at index 0
last_element=ages[13] # this is the element at index 13
print(first_element, "to", last_element)


21 to 28


In [26]:
# Slicing - getting a subset of the elements in the list.   
first_3=ages[0:3] # the same thing as ages[:3] 
last_3=ages[-3:] # the same thing as ages[10:14]
print(first_3, "and", last_3)

[21, 20, 19] and [30, 19, 28]


In [27]:
ages[10:14] 

[22, 30, 19, 28]

### Other common list operations

In [28]:
# Check if values in list:
40 not in ages # true if value is not in the list

True

In [29]:
# sorting the list by values:
ages.sort()
ages

[19, 19, 20, 20, 21, 21, 21, 22, 22, 23, 26, 28, 30, 33]

In [30]:
# adding to the list
ages.append(2)
ages

[19, 19, 20, 20, 21, 21, 21, 22, 22, 23, 26, 28, 30, 33, 2]

In [31]:
# concatenating two lists:
l1=["a", "b", "c"]
l2=[1, 2, 3]
l3=l1+l2
l3

['a', 'b', 'c', 1, 2, 3]

In [32]:
# removing an element from the list by value
ages.remove(2)
ages

[19, 19, 20, 20, 21, 21, 21, 22, 22, 23, 26, 28, 30, 33]

In [33]:
# finding the index (position) of the first place where the value occurs in the list
ages.index(21)

4

In [34]:
# remove an element from the list by index
del ages[0:5]
ages

[21, 21, 22, 22, 23, 26, 28, 30, 33]

### Sets
A set contains an unordered collection of unique and immutable objects. If you want to get all unique values in a list, a quick way it to transform the list into a set:

In [35]:
set_ages=set(ages)
set_ages

{21, 22, 23, 26, 28, 30, 33}

In [36]:
unique_ages=list(set_ages)
unique_ages

[33, 21, 22, 23, 26, 28, 30]

### Dictionaries
In a dictionary, an entry consists of a word and the word's definition. The word is the key to finding out what a word means, and what the word means is considered the value for that key. In Python, dictionaries have keys and values. Keys are used to find values. Here is a dictionary of people and their ages: 

In [37]:
mydict = {"John": 21,
          "Jake": 20,
          "Jack": 23,
         }
mydict

{'Jack': 23, 'Jake': 20, 'John': 21}

In [38]:
mydict.keys()

dict_keys(['John', 'Jake', 'Jack'])

In [39]:
mydict.values()

dict_values([21, 20, 23])

In [40]:
mydict["John"]

21

Dictionaries will be very useful when we start working with web data, such as social media data. 

## Indentation
Python **requires** blocks to be structured through indentation. Not just as a matter of style, but as a rule. Statements with the same distance to the left belong to the same block of code. To nest blocks, you need to indent them further to the right. The number of white spaces doesn't matter, what matters is that you are consistently using the same number for blocks that are at the same level. Usually, we start at the very left edge, and each level in goes a further 1 tab (or 4 white spaces) to the right. If the code does not follow this rule about the relative indentation of blocks, then you will get an **IndentationError**.

However, the indentation level is ignored when you use explicit (or implicit) continuation lines. You can split a list or dictionary across multiple lines, and the indentation doesn't matter.

You will see a few examples in the sections below. 

## Loops
Most of our work involves some type of iteration over observations in a dataset. Iteration is very easy and intuitive in Python, and there are  many ways to loop through data in order to access and manipulate it. 

In [41]:
# for loops
for i in range(5):
 print("I can count to "+str(i))

I can count to 0
I can count to 1
I can count to 2
I can count to 3
I can count to 4


In [42]:
# while loops
counter = 0
while counter < 5:
    print("I can count to", counter)
    counter += 1

I can count to 0
I can count to 1
I can count to 2
I can count to 3
I can count to 4


In [43]:
#List comprehension
k=[key for key in mydict.keys()]
k

['John', 'Jake', 'Jack']

## Conditional statements
Data management, processing and analysis involve taking a series of decisions. We use conditional statements (most often in the form of if statements) to take these decitions.   

In [44]:
# if statement
for i in range(5):
    if i  % 2 == 0:
        print("I can count even numbers to "+str(i))

I can count even numbers to 0
I can count even numbers to 2
I can count even numbers to 4


In [45]:
# if-else statement
for i in range(5):
    if i  % 2 == 0:
        print("I can count even numbers to "+str(i))
    else:
        print("I can count odd numbers to "+str(i))

I can count even numbers to 0
I can count odd numbers to 1
I can count even numbers to 2
I can count odd numbers to 3
I can count even numbers to 4


In [46]:
# if-elif-else statements
for i in range(5):
    if i<=1:
        print("I can count to "+str(i))
    elif 2<=i<=3:
        print("I can also count to "+str(i))
    else:
        print("But I can't count to "+str(i))

I can count to 0
I can count to 1
I can also count to 2
I can also count to 3
But I can't count to 4


## Writing functions
You often have to perform the same type of task many times, on different data. To avoid writing the same code over and over, you can write functions that can be called every time you want to perform the specific task. 

In [47]:
def power_of(a, b):
 return a**b
print(power_of(2,3))

8


In [48]:
print(power_of(3,5))

243


## Reading and writing files
You can use the read, write, readlines and writelines functions from base R to read and write files. 
We have the examples.csv file that you saved from ELE. Let's say you are interested in what regions there are in this data. Let's start by creating a set called regions, which we will populate with the values available in the data. 


In [49]:
regions=set() # create an empty set
with open("mydataset.csv", "r") as myfile:  # open the file for reading
    data = myfile.readlines() 
    for line in data:
        region=line.split(",")[1]
        regions.add(region)
print(regions)

{'North', 'South', 'East', 'North-East', 'West', 'region'}


Now we can open a new file, and write the regions to it: 

In [50]:
with open("regions.txt", "w") as myfile:  # open the file for writing
    for region in regions:
        myfile.write(region + '\n' )

You can also append to a file in mode "a" and open it both for reading and writing in mode "r+". 

In [51]:
with open("regions.txt", "a") as myfile:  # open the file for writing
    for region in regions:
        myfile.write(region + '\n' )

## Importing third party modules
Everything that we've done so far was based on functions from base Python. However, we will often need to import other packages which can handle more complex or specific tasks. For example, we may want to use a module that is able to better read and write csv data, such as the 'csv' module. To do that, we have to first import the module. For packages that are already installed, you can simply do that by typing 'import' and the name of the package. Many of the useful packages are already installed in Anaconda. 

But how do you know which packages are installed? If you open Anaconda Prompt, and you type "conda list", it will list all installed package. You can do this in any terminal/command prompt. 

## Reading data in different formats

The 'csv' module is already installed in Anaconda, so we can go ahead and import it. Let's read the file in csv format, recode missing values as NA, and write it out as a new clean.csv. The 'csv' module is very useful for manipulating large files that contain long text fields.  

In [52]:
import csv
with open("clean.csv", "w") as outfile:
    writer=csv.writer(outfile)
    with open("mydataset.csv", "r") as infile:  # open the file for writing
        reader=csv.reader(infile)
        writer.writerow(next(reader))
        for row in reader:
            writer.writerow(row[0:6]+[(row[6].replace("missing", "NaN"))])
            

You can also read csv files, as well as other file formats using Pandas. Pandas is one of the main libraries for data analysis in Python. For those of you familiar with R, the data frames structure and Pandas will make it very easy to use. Let's see what we can do, by importing the clean.csv file that you just saved. 

In [53]:
import pandas as pd # we import it as pd because it's easier to type
df=pd.read_csv("clean.csv")
df

Unnamed: 0,id,region,party,chamber,spent,raised,reelected
0,1,East,Centre,H,285937,411847,0.0
1,2,East,Centre,H,308530,1301546,1.0
2,3,East,Centre,H,435962,629768,4.0
3,4,East,Centre,H,685526,737446,3.0
4,5,East,Centre,H,242312,370557,1.0
5,6,East,Centre,H,149546,432485,3.0
6,7,East,Centre,H,618818,850163,2.0
7,8,East,Centre,H,354655,364555,2.0
8,9,East,Centre,H,147248,165364,0.0
9,10,East,Centre,H,306052,360675,3.0


In [54]:
df.head(5) # first 5 entries

Unnamed: 0,id,region,party,chamber,spent,raised,reelected
0,1,East,Centre,H,285937,411847,0.0
1,2,East,Centre,H,308530,1301546,1.0
2,3,East,Centre,H,435962,629768,4.0
3,4,East,Centre,H,685526,737446,3.0
4,5,East,Centre,H,242312,370557,1.0


In [55]:
df.columns # the column names

Index(['id', 'region', 'party', 'chamber', 'spent', 'raised', 'reelected'], dtype='object')

In [56]:
df["reelected"][0:5] # select a column, and a slice within it

0    0.0
1    1.0
2    4.0
3    3.0
4    1.0
Name: reelected, dtype: float64

In [57]:
# Subsetting data: create another data frame that only includes obswervations from the South and East. 
value_list=["South", "East"]
df_SE=df[df.region.isin(value_list)] # Replace this with df[~df.region...] to keep 
                                    #only those that don't meet the condition
df_SE.count()

id           216
region       216
party        216
chamber      216
spent        216
raised       216
reelected    214
dtype: int64

In [58]:
# Select only dataframes that meet multiple conditions:
df_restricted=df[(df['region']=="South") & (df["chamber"]=="S") & 
                 (df["reelected"]==0)]
df_restricted.head(5)

Unnamed: 0,id,region,party,chamber,spent,raised,reelected
356,357,South,Centre,S,199176,436192,0.0
358,359,South,Centre,S,221402,424304,0.0
392,393,South,Left,S,1768956,4699994,0.0
394,395,South,Left,S,1972873,48947,0.0
428,429,South,Right,S,719563,3231786,0.0


In [59]:
# Group and aggregate 
grouped=df.groupby(["region", "chamber"])
aggregated=grouped.agg({"spent":['sum','mean', 'min'], 
                       'raised':['sum', 'mean', 'max']})
aggregated

Unnamed: 0_level_0,Unnamed: 1_level_0,spent,spent,spent,raised,raised,raised
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,min,sum,mean,max
region,chamber,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
East,H,29446708,320072.913043,0,55481753,603062.5,4205366
East,S,7259432,453714.5,0,12413611,775850.7,3959212
North,H,21778335,259265.892857,0,39890400,474885.7,1773323
North,S,12482798,520116.583333,0,27547225,1147801.0,6263060
North-East,H,27511352,348244.962025,0,53487404,677055.7,4091159
North-East,S,13722168,490077.428571,0,43154780,1541242.0,9790929
South,H,29046467,330073.488636,0,56751507,644903.5,3020933
South,S,11140983,557049.15,40206,22286751,1114338.0,4699994
West,H,30186789,331722.956044,43175,59857901,657779.1,5169778
West,S,6797473,399851.352941,0,15904129,935537.0,4631824


## Basic visualization
To display graphs inline in Jupyter notebooks make sure you add "%matplotlib inline" in the first cell. 

In [60]:
%matplotlib inline 
%matplotlib notebook

### Histograms, comparing two distributions. 

In [61]:
import matplotlib.pyplot as plt
df_money=df[["raised","spent"]]
plt.figure()
df_money.plot.hist(stacked=True, bins=50)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x18173e695f8>

### Scatterplot with linear fit line

In [62]:
import numpy as np

x=df_money.raised.values
y=df_money.spent.values
fig, ax = plt.subplots()
fit = np.polyfit(x, y, deg=1)
ax.plot(x, fit[0] * x + fit[1], color='red')
ax.scatter(x, y)


<IPython.core.display.Javascript object>

<matplotlib.collections.PathCollection at 0x1817494b780>

### Barplots


In [63]:
agg2=grouped.agg({"spent":"mean", 
                  "raised":"mean"})
agg2.plot.bar()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x18174bdd668>

## Additional resources

### Q-Step workshops

**Term 1**

7 December: **Social media data collection and analysis**

Working with the Twitter and Facebook APIs, data management, text processing and intro to text analysis, basic network analysis. 

**Term 2:** 

TBA: **Data analysis in Python** 

Covering: Overview of most common packages, descriptive statistics, statistical analysis (regression, etc.), visualization.

TBA: **Text analysis in Python**

An introduction to text analysis. 


### Other beginner resources
All very hands-on, excellent for beginners, both in Python and in programming in general. 

[The Python Tutorial](https://docs.python.org/3/tutorial/index.html)

[Learn Python the Hard Way](https://learnpythonthehardway.org/book/)

[Dive Into Python 3](http://www.diveintopython3.net/)

