# Fundamentals of Data Analysis with Python 

## Day 1: Fundamentals of Python for Researchers 

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>


### Overview 

This notebook focuses on some fundamentals of Python programming with social scientists in mind. We will cover general Python basics, from manipulating simple data types like `strings` to writing your own functions. Our goal is to **lay a foundation** that we can build on in the rest of the week. 

If you are new to programming, this may seem like a lot of new information to process and skills to master. *No need to worry.* You will continue to learn and develop over the course of the week. Concepts introduced here will be reinforced later. Focus on understanding; mastery will come with time and practice. 

### Plan for the Day

1. [General introductions](#intro)    
2. [Overview of the course](#overview)    
3. [Getting started with Python and Jupyter Notebooks](#starting)    
4. [Simple data types, variables, and assignment](#datatypes)    
5. [Functions](#functions)  
6. [Flow control and conditionals](#conditional)    
7. [Lists and tuples](#lts)    
8. [Dictionaries](#dicts)    
9. [Open Work Time](#open)

<hr>

# General Introductions <a id='intro'></a>

## John

* Associate Professor, University of Waterloo, Ontario, Canada 
* PhD in Sociology, 2013
    * Computational Social Science (Networks, Applied Natural Language Processing, Machine Learning)
    * Cognitive Social Science, Environmental Social Science (socio-ecological systems + environmental movements), Democracy and the Politics of Knowledge and Information 
* PI [Netlab](networkslab.org) @ the University of Waterloo
* McLevey. 2020. *Doing Computational Social Science*, London: Sage. 
* Python packages: `metaknowledge`, `nate`, `pdpp`, etc. 
* Other interests: hiking, cycling, strength training, squash, design, travel, photography  

## Jillian

* Big Data Developer, Simon Fraser University, British Columbia, Canada
* Bachelor's in Knowledge Integration and Computer Science (minor) from the University of Waterloo
* MSc in Computing Science and Big Data from Simon Fraser University 
* Extensive experience developing in Python, R, Java, C, and Javascript
* 2016-2018, Data Scientist @ [Netlab](networkslab.org)
* Machine learning for agriculture
* Other interests: ...

### Who Are You? 

* Where you are coming from?
* What are your primary research interests?
* What do you most hope to get out of the course?

# Overview of the Course<a id='over'></a>

* Day 1: **Python 101**
    * Emphasis on laying a foundation 
    * A little *general* knowledge of Python goes a very long way
* Day 2: **Getting Data From the Web**
    * Emphasis on web scraping and APIs
* Day 3: **Scientific computing / research computing 101** 
    * Emphasis on `numpy`, `pandas`, and `matplotlib`
* Day 4: **Unstructured data 101**
    * Emphasis on computing with natural language data
* Day 5: **Data manipulation**
    * Emphasis on reshaping and combining data

By the end of this week, you will have a solid foundation for working with digital behavoural data and applying methods from computational social science and social data science, such as machine learning methods and network analysis. 

## Process

* We will use the Software Carpentry post-it system 
* Please **ask questions at any time**. If you have a question, someone else probably does too. 
* We want to meet you where you are, which means we might make some minor revisions to the materials over the course of the week

# Getting Started with Python and Jupyter Notebooks <a id='starting'></a>

> "You will have to learn some basic programming concepts before you can do anything. Like a wizard-in-training, you might think these concepts seem arcane and tedious, but with some knowledge and practice, you'll be able to command your computer like a magic wand to perform incredible feats." 
>
> Al Sweigart, [*Automate the Boring Stuff with Python*](https://automatetheboringstuff.com)

If this is the first time programming, you should not expect to complete this notebook with a high level of mastery. You will likely have to move on to new content before you feel truly comfortable with this fundamental content. *That's perfectly normal*. Instead of getting stuck here, you should continue with the course material and revisit this material as needed. You will become more comfortable with it over time. 

...

## Interactive Computing & Project Jupyter

...

![](img/jupyter.png)

# Simple Data Types, Variables, and Assignment <a id='datatypes'></a>

Every value in Python has a single data type. The key data types to know are **integers** (e.g. `42`), **floats** (e.g. `42.0`), and **strings** (e.g. `'The Hitchhiker's Guide to the Galaxy'`, `'cats are the best'`, and `'The Night Manager'`).

In [1]:
2 + 2

4

In [2]:
2 * 9 

18

In [3]:
10 / 2

5.0

In [4]:
2 ** 6

64

In [5]:
2 + 9 * 7

65

We can store data in 'variables' by 'assignment', indicated by the `=` operator. We can call variables anything we want, provided (1) we only use one word; (2) we only use numbers, letters, and the underscore character (`_`); (3) we don't start the name with a number; and (4) we do not use any special words that are reserved for Python itself (e.g. `class`). My advice is to use descriptive names for your variables (e.g. if the variable stores a string of your last name, call it `last_name`, not `ln`). 

In [6]:
a_number = 16
print(a_number)

16


In [11]:
a_number * a_number

256

In [14]:
city = 'Cologne'
country = 'Germany'

print(city)
print(country)
print(city, country)

Cologne
Germany
Cologne Germany


### <font color="tomato">YOUR TURN!</font> <a id='yt1'></a>

Integers, floating points, and strings are three basic data types in Python. In the cell below, (1) assign your instituional affiliation to a variable called `affiliation`, (2) assign your first name to a variable called `first_name`, and (3) assign your last name to a variable called `last_name`. 

In [13]:
# Your Answer Here 

As you know from the previous examples, the `+` operator will add two numbers together if the data types are integers or floats. However, if the data types are strings, `+` will perform string concatenation. 

In [15]:
city + country

'CologneGermany'

In [17]:
city + ', ' + country

'Cologne, Germany'

In [19]:
print(city + ' is the fourth-most populous city in ' + country)

Cologne is the fourth-most populous city in Germany


In [20]:
print("{} is the fourth-most populous city in {}.".format(city, country))

Cologne is the fourth-most populous city in Germany.


In [25]:
print("GESIS (at The Leibniz Institute for the Social Sciences) is in {0}, {1}. {0} is the fourth-most populous city in {1}.".format(city, country))

GESIS (at The Leibniz Institute for the Social Sciences) is in Cologne, Germany. Cologne is the fourth-most populous city in Germany.


In [35]:
len(country)

7

In [31]:
print(city, len(city))

Cologne 7


In [32]:
print(country, len(country))

Germany 7


If we mix data types in an expression using the `+` operator, Python will throw an error, because it can't add an integer or float and a string, and it can't concatenate a string and an integer or float. 

In [36]:
city + 42

TypeError: can only concatenate str (not "int") to str

In [37]:
city * 3

'CologneCologneCologne'

There are quite a lot of things we can do to strings in Python. If you want to learn a bit more about string manipulation right now, you can consult [Chapter 6: Manipulating Strings](https://automatetheboringstuff.com/chapter6/) from *Automate the Boring Stuff*. Alternatively, you can wait. We will introduce various methods for string manipulation throughout the course. 

# Flow Control and Conditionals <a id='conditional'></a>

We have already seen how to tell our computer to execute individual instructions, such as evaluate the expression `2 + 2`. Most of the time, however, we don't want our computer to simply execute a series of individual instructions one after the other. Instead, we want to be able to tell our computer to execute instructions *depending on some condition*. This is 'flow control'. 

Flow control statements usually include a 'condition' and a 'clause', contained within a 'block'. 

* <font color="tomato">condition</font>: ... 
* <font color="tomato">clause</font>: ... 
* <font color="tomato">block</font>: ...

The cell below executes a simple `if` control flow statement. First, it prompts you to enter the name of this course into a box. Then it executes the statement. In the cell below the code block, translate this control flow statement into plain English. 

In [40]:
print('Please type the name of this course into the box below. ')
course = input()

if course == 'Fundamentals of Data Analysis with Python':
    print('Welcome to the course!')
elif course == 'A Practical Introduction to Machine Learning in Python':
    print('You are in the wrong course! That course starts next week.')
elif course == 'Social Network Analysis with Digital Behavioural Data':
    print('You are in the wrong course! That course starts in two weeks.')
else:
    print('🙄 🤔')

Please type the name of this course into the box below. 


 Fundamentals


🙄 🤔


### <font color="tomato">YOUR TURN!</font> <a id='yt4'></a>

The cell below executes a simple `while` control flow statement. In the cell below the code block, translate this control flow statement into plain English. 

In [43]:
day = 1

while day <= 5:
    print("It's Day {}. The course is still in progress.".format(day))
    day = day + 1

print('\nThe course is complete. Congratulations!')

It's Day 1. The course is still in progress.
It's Day 2. The course is still in progress.
It's Day 3. The course is still in progress.
It's Day 4. The course is still in progress.
It's Day 5. The course is still in progress.

The course is complete. Congratulations!


In [44]:
# Your Answer Here 

# Functions <a id='functions'></a>

So far we have used a few functions that are built in to Python, such as `print()` and `len()`. We can also write our own functions, which let us execute small chunks of code. Writing our own functions is a very powerful way of compartmentalizing and organizing our code. 

User-defined functions can be as simple or as complex as we like (although you should strive to design functions that are *as simple as possible*). For example, the following cell defines a function called `welcome()`, which asks a user for their name and welcomes them to the course. 

In [45]:
def welcome(name):
    print('Hello, {}. Welcome to Fundamentals of Data Analysis with Python!'.format(name))

In [46]:
welcome('Miyoko')

Hello, Miyoko. Welcome to Fundamentals of Data Analysis with Python!


* reasons to write functions 
* local vs. global scope
* conventions to follow 
* psedocode

### <font color="tomato">YOUR TURN!</font> <a id='yt8'></a>

In the cell below the code block, translate the function `welcome_message()` into plain English. How does the returned string change depending on the values passed into the function?

In [54]:
def welcome_message(name, course):
    if course == 'fundamentals':
        msg = 'Hello {}, welcome to Fundamentals of Data Analysis with Python.'.format(name)
    elif course == 'machine_learning':
        msg = 'Hello {}, welcome to A Practical Introduction to Machine Learning in Python.'.format(name)
    elif course == 'networks':
        msg = 'Hello {}, welcome to Social Network Analysis with Digital Behavioral Data.'.format(name)
    else:
        msg = 'Sorry, I think you might be in the wrong course.'
    
    return msg

In [58]:
welcome = welcome_message('Karamo', 'fundamentals')
print(welcome)

Hello Karamo, welcome to Fundamentals of Data Analysis with Python.


# Lists, Tuples, and Sets <a id='lts'></a>

This section covers slightly more complex data structures: `lists`, `tuples`, and `sets`. Let's start with `lists`. 

## Lists

`lists` can contain multiple values in an ordered sequence, such as `['fundamentals', 'of', 'data', 'analysis', 'with', 'python']`, `['GESIS', 'Leibniz Institute for Social Sciences']`, or `['42', '77', 'mix', 'data', 'types', 42, 77]`. 

In [73]:
course = ['fundamentals', 'of', 'data', 'analysis', 'with', 'python']

In [74]:
course

['fundamentals', 'of', 'data', 'analysis', 'with', 'python']

In [75]:
course_joined = " ".join(course)
course_joined

'fundamentals of data analysis with python'

In [78]:
course_split = course_joined.split(" ")
course_split

['fundamentals', 'of', 'data', 'analysis', 'with', 'python']

In [79]:
course == course_split

True

In [80]:
print('There are {} words in the course title.'.format(len(course)))

There are 6 words in the course title.


In [81]:
for word in course:
    print(word)

fundamentals
of
data
analysis
with
python


In [82]:
course[0]

'fundamentals'

In [83]:
course[1]

'of'

In [84]:
course[2]

'data'

In [85]:
course[3]

'analysis'

In [86]:
course[4]

'with'

In [87]:
course[5]

'python'

In [88]:
course[-1]

'python'

In [89]:
course[-2]

'with'

In [90]:
for i in range(len(course)):
    print(course[i])

fundamentals
of
data
analysis
with
python


In [91]:
course.append('John McLevey')
course.append('Jillian Anderson')
course.append('GESIS')

In [92]:
course

['fundamentals',
 'of',
 'data',
 'analysis',
 'with',
 'python',
 'John McLevey',
 'Jillian Anderson',
 'GESIS']

In [93]:
course[6:]

['John McLevey', 'Jillian Anderson', 'GESIS']

In [94]:
course[-2]

'Jillian Anderson'

In [95]:
course.index('GESIS')

8

### <font color="tomato">YOUR TURN!</font> <a id='yt9'></a>

In the cell below, change the value of `GESIS` to `Leibniz Institute for Social Sciences` in the list `course` using list indices. 

In [96]:
# Your Answer Here

Let's see if it worked. 

In [97]:
if 'Leibniz Institute for Social Sciences' in course:
    print('Not yet. Try again!')
else:
    print('Changed! Well done.')

Changed! Well done.


In [98]:
countries = ['Canada', 'Germany']
countries

['Canada', 'Germany']

It is also possible to store lists inside of lists.  

In [99]:
who = [countries, ['University of Waterloo', 'Leibniz Institute']]
who

[['Canada', 'Germany'], ['University of Waterloo', 'Leibniz Institute']]

In [100]:
for each in who:
    print(each)

['Canada', 'Germany']
['University of Waterloo', 'Leibniz Institute']


In [101]:
for each in who:
    for value in each:
        print(value)
    print('\n')

Canada
Germany


University of Waterloo
Leibniz Institute




## Tuples

`tuples` are very similar to `lists` except that they are 'immutable' and values are stored in between `()` rather than `[]`. 

* <font color="tomato">mutable</font>: ...
* <font color="tomato">immutable</font>: ...

In other words, we can modify the values of a `list` because `lists` are mutable. We can't modify the values of a `tuple`, because `tuples` are immutable. One advantage of using tuples is ... 

In [102]:
course_list = course.copy()
course_list

['fundamentals',
 'of',
 'data',
 'analysis',
 'with',
 'python',
 'John McLevey',
 'Jillian Anderson',
 'GESIS']

In [103]:
course_tuple = tuple(course)
course_tuple

('fundamentals',
 'of',
 'data',
 'analysis',
 'with',
 'python',
 'John McLevey',
 'Jillian Anderson',
 'GESIS')

In [104]:
course_list.sort()
course_list

['GESIS',
 'Jillian Anderson',
 'John McLevey',
 'analysis',
 'data',
 'fundamentals',
 'of',
 'python',
 'with']

In [105]:
course_list.sort(reverse=True)
course_list

['with',
 'python',
 'of',
 'fundamentals',
 'data',
 'analysis',
 'John McLevey',
 'Jillian Anderson',
 'GESIS']

In [106]:
course_tuple.sort()
course_tuple

AttributeError: 'tuple' object has no attribute 'sort'

# Dictionaries

`Dictionaries` are a flexible way to organize data. `Dictionaries` contain `key`-`value` pairs. For example, the following `dictionary` contains data on my cat Dorothy. The `keys` are name, age, colors, and the cities in which she has lived. The `values` are 'Dorothy', '11', a list of colors, and a list of cities. 

In [139]:
uni1 = {
    'name': 'Simon Fraser University',
    'province': 'British Columbia',
    'country': 'Canada',
    'cities': ['Burnaby', 'Surrey', 'Vancouver'],
    '2015enrollments': 34990
}

In [140]:
uni1

{'name': 'Simon Fraser University',
 'province': 'British Columbia',
 'country': 'Canada',
 'cities': ['Burnaby', 'Surrey', 'Vancouver'],
 '2015enrollments': 34990}

In [145]:
uni1['cities']

['Burnaby', 'Surrey', 'Vancouver']

In [146]:
uni1['2015enrollments']

34990

In [147]:
for city in uni1['cities']:
    print(city)

Burnaby
Surrey
Vancouver


In [148]:
for v in uni1:
    print(v)

name
province
country
cities
2015enrollments


In [149]:
for k,v in uni1.items():
    print(v)

Simon Fraser University
British Columbia
Canada
['Burnaby', 'Surrey', 'Vancouver']
34990


In [150]:
for each in uni1.values():
    print(each)

Simon Fraser University
British Columbia
Canada
['Burnaby', 'Surrey', 'Vancouver']
34990


As you can see, we can use `lists` inside of `dictionaries` and `lists` inside other `lists`, we can have `dictionaries` inside `lists` and even `dictionaries` inside of other `dictionaries`. Combining data structures this way enables us to model the real world in powerful and flexible ways. 

In [151]:
uni2 = {
    'name': 'University of Waterloo',
    'province': 'Ontario',
    'country': 'Canada',
    'cities': ['Waterloo', 'Kitchener', 'Stratford', 'Cambridge'],
    '2015enrollments': 36670
}

In [152]:
universities = [uni1, uni2]

In [153]:
universities

[{'name': 'Simon Fraser University',
  'province': 'British Columbia',
  'country': 'Canada',
  'cities': ['Burnaby', 'Surrey', 'Vancouver'],
  '2015enrollments': 34990},
 {'name': 'University of Waterloo',
  'province': 'Ontario',
  'country': 'Canada',
  'cities': ['Waterloo', 'Kitchener', 'Stratford', 'Cambridge'],
  '2015enrollments': 36670}]

In [154]:
for uni in universities:
    print(uni['name'], uni['province'], uni['country'])

Simon Fraser University British Columbia Canada
University of Waterloo Ontario Canada


### <font color="tomato">YOUR TURN!</font> <a id='yt12'></a>

In the cell below, (1) create a dictionary called `me` with the following `keys`: 'first_name', 'last_name', 'discipline' (e.g. political science), 'research_interests',  and 'programming_experience_level'. Enter the correct values. Once you have created this dictionary, (2) print each value to the screen using a for loop. 

In [155]:
# Your Answer Here 

# Open Work Time <a id='open'></a>