# FUNDAMENTALS OF DATA ANALYSIS WITH PYTHON <br><font color="crimson">DAY 1: FUNDAMENTALS OF PYTHON FOR RESEARCHERS </font>

49th [GESIS Spring Seminar: Digital Behavioral Data](https://training.gesis.org/?site=pDetails&pID=0xA33E4024A2554302B3EF4AECFC3484FD)   
Cologne, Germany, March 2-6 2010

### Course Developers and Instructors 

* Dr. [John McLevey](www.johnmclevey.com), University of Waterloo (john.mclevey@uwaterloo.ca)     
* [Jillian Anderson](https://ca.linkedin.com/in/jillian-anderson-34435714a?challengeId=AQGaFXECVnyVqAAAAW_TLnwJ9VHAlBfinArnfKV6DqlEBpTIolp6O2Bau4MmjzZNgXlHqEIpS5piD4nNjEy0wsqNo-aZGkj57A&submissionId=16582ced-1f90-ec15-cddf-eb876f4fe004), Simon Fraser University (jillianderson8@gmail.com) 

<hr>

## <i class="fa fa-book"></i> A NOTE ON THE SYLLABUS AND READINGS 

There is a syllabus posted on the [learning system](https://ilias.gesis.org/). We plan to cover everything on the styllabus, but we made some last minute changes to the sequence topics are presented in. This is to ensure to necessary content is presented in the best order for learning, and to get you working with data as quickly as possible. 

There are a number of readings posted on the learning system. We have provided these as supplementary resources. We encourage you to read some of them, but it is possible to understand all of the material in these notebooks without them. If you are pressed for time, it is better to focus on understanding what is going on in these notebooks. 

## <i class="fa fa-tasks"></i> OVERVIEW 

This notebook focuses on some fundamentals of Python programming with social scientists in mind. We will cover general Python basics, from manipulating simple data types like `strings` to writing your own functions. Our goal is to **lay a foundation** that we can build on in the rest of the week. 

If you are new to programming, this may seem like a lot of new information to process and skills to master. *No need to worry.* You will continue to learn and develop over the course of the week. Concepts introduced here will be reinforced later. Focus on understanding; mastery will come with time and practice. 

## <i class="fa fa-map-o"></i> PLAN FOR THE DAY

<i class="fa fa-location-arrow"></i> [General introductions](#intro)     
<i class="fa fa-location-arrow"></i> [Overview of the course](#overview)      
<i class="fa fa-location-arrow"></i> [Getting started with Python and Jupyter Notebooks](#starting)       
<i class="fa fa-location-arrow"></i> [Simple data types, variables, and assignment](#datatypes)      
<i class="fa fa-location-arrow"></i> [Functions](#functions)  
<i class="fa fa-location-arrow"></i> [Flow control and conditionals](#conditional)    
<i class="fa fa-location-arrow"></i> [Lists and tuples](#lts)    
<i class="fa fa-location-arrow"></i> [Dictionaries](#dicts)    
<i class="fa fa-location-arrow"></i> [Open Work Time](#open)

<hr>

# <i class="fa fa-users"></i> General Introductions <a id='intro'></a>

## <i class="fa fa-user"></i> John

* Associate Professor, University of Waterloo, Ontario, Canada 
* PhD in Sociology, 2013
* Research Interests:
    * Computational Social Science (Network Analysis, Applied Natural Language Processing, Machine Learning)
    * Cognitive Social Science, Environmental Movements and Governance, Democracy and the Politics of Knowledge and Information 
* PI [Netlab](networkslab.org) @ the University of Waterloo
* McLevey. 2020. *Doing Computational Social Science*, London: Sage. 
* Python packages: `metaknowledge`, `nate`, `pdpp`, etc. 
    * <i class="fa fa-github"></i> [github.com/UWNETLAB](https://github.com/UWNETLAB) 
* Other interests: 
    * Hiking, cycling, strength training, squash, design, travel, photography  

## <i class="fa fa-user"></i> Jillian

* Big Data Developer, Simon Fraser University, British Columbia, Canada
* Bachelor's in Knowledge Integration and Computer Science (minor) from the University of Waterloo
* MSc in Computing Science and Big Data from Simon Fraser University 
* Extensive experience developing in Python, R, Java, C, and Javascript
* 2016-2018, Data Scientist @ [Netlab](networkslab.org)
* Development Interests: 
    * high performance computing, data science for agriculture,  and the social and ethical implications of data science work.
* Other interests: 
    * Squash, backcountry camping, cycling, cooking, & skiing. 

## <i class="fa fa-user"></i> Who Are You? 

* Where you are coming from?
* What are your primary research interests?
* What do you most hope to get out of the course?
* Do you have programming experience in Python, R, or other languages?

## <i class="fa fa-tasks"></i> OVERVIEW OF THE COURSE<a id='over'></a>

* Day 1: **Python 101**
    * Emphasis on laying a foundation 
    * A little *general* knowledge of Python goes a very long way
* Day 2: **Getting Data From the Web**
    * Web scraping 
    * Application Programming Interfaces (APIs)
* Day 3: **Scientific computing / research computing 101** 
    * Emphasis on `pandas`
* Day 4: **Data Visualization**
    * Emphasis on `matplotlib` and `seaborn`
* Day 5: **Open Day**
    * Data Analysis Challenges 
    * Questions and Review 
    * 1:1 Consultations on Research Projects 
    * *If there is interest*: A brief introduction to working with unstructured text 

By the end of this week, you will have a solid foundation for working with digital behavioural data and applying methods from computational social science and social data science, such as machine learning methods and network analysis. 

## Process

<i class="fa fa-question-circle"></i> Please **ask questions at any time**. If you have a question, someone else probably does too. 

<i class="fa fa-refresh"></i> We want to meet you where you are, which means we might make some minor revisions to the materials over the course of the week. If we make changes to these notebooks, we will provide you with the updated versions at the start of the day. 

# <i class="fa fa-location-arrow"></i> GETTING STARTED WITH PYTHON & JUPYTER  <a id='starting'></a>

> "You will have to learn some basic programming concepts before you can do anything. Like a wizard-in-training, you might think these concepts seem arcane and tedious, but with some knowledge and practice, you'll be able to command your computer like a magic wand to perform incredible feats." 
>
> Al Sweigart, [*Automate the Boring Stuff with Python*](https://automatetheboringstuff.com)

If this is the first time programming, you should not expect to complete this notebook with a high level of mastery. You will likely have to move on to new content before you feel truly comfortable with this fundamental content. *That's perfectly normal*. Instead of getting stuck here, you should continue with the course material and revisit this material as needed. You will become more comfortable with it over time. 

## Interactive Computing & Project Jupyter

![](img/jupyter.png)

As described on the syllabus, you should already have Anaconda Python 3.7 installed on your laptop. If so, you will also have Jupyter Installed. We will take a bit of time to make sure everyone has the software installed properly. 

## Installation 

1. Download the [Python 3.7 Anaconda Distribution](https://www.anaconda.com/distribution/) for your operating system 
2. Install Anaconda. 
3. Make yourself some tea or coffee ☕️

Many of the additional packages we will use in this course will be installed with the Anaconda distribution. Others can be installed quickly as we encounter them in the course materials. The easiest way to do this is by executing a system command from your Jupyter notebook itself. For example, the code block below uses `pip` to install a package called `seaborn`. 

In [None]:
!pip install seaborn

Once you have installed Anaconda Python 3.7, we can start programming!

# <i class="fa fa-location-arrow"></i> SIMPLE DATA TYPES, VARIABLES, AND ASSIGNMENT <a id='datatypes'></a>

Every value in Python has a single data type. The key data types to know are **integers** (e.g. `42`), **floats** (e.g. `42.0`), and **strings** (e.g. `'The Hitchhiker's Guide to the Galaxy'`, `'cats are the best'`, and `'The Night Manager'`).

In [1]:
2 + 2

4

In [2]:
2 * 9 

18

In [3]:
10 / 2

5.0

In [4]:
2 ** 6

64

In [5]:
2 + 9 * 7

65

We can store data in 'variables' by 'assignment', indicated by the `=` operator. We can call variables anything we want, provided (1) we only use one word; (2) we only use numbers, letters, and the underscore character (`_`); (3) we don't start the name with a number; and (4) we do not use any special words that are reserved for Python itself (e.g. `class`). My advice is to use descriptive names for your variables (e.g. if the variable stores a string of your last name, call it `last_name`, not `ln`). 

In [6]:
a_number = 16
print(a_number)

16


In [7]:
a_number * a_number

256

In [8]:
city = 'Cologne'
country = 'Germany'

print(city)
print(country)
print(city, country)

Cologne
Germany
Cologne Germany


## <font color="crimson"><i class="fa fa-user"></i> YOUR TURN!</font>

Integers, floating points, and strings are three basic data types in Python. In the cell below, (1) assign your instituional affiliation to a variable called `affiliation`, (2) assign your first name to a variable called `first_name`, and (3) assign your last name to a variable called `last_name`. 

In [9]:
# Your Answer Here 





As you know from the previous examples, the `+` operator will add two numbers together if the data types are integers or floats. However, if the data types are strings, `+` will perform string concatenation. 

In [10]:
city + country

'CologneGermany'

In [11]:
city + ', ' + country

'Cologne, Germany'

In [12]:
print(city + ' is the fourth-most populous city in ' + country)

Cologne is the fourth-most populous city in Germany


In [13]:
print("{} is the fourth-most populous city in {}.".format(city, country))

Cologne is the fourth-most populous city in Germany.


In [14]:
print("GESIS (at The Leibniz Institute for the Social Sciences) is in {0}, {1}. {0} is the fourth-most populous city in {1}.".format(city, country))

GESIS (at The Leibniz Institute for the Social Sciences) is in Cologne, Germany. Cologne is the fourth-most populous city in Germany.


In [15]:
len(country)

7

In [16]:
print(city, len(city))

Cologne 7


In [17]:
print(country, len(country))

Germany 7


If we mix data types in an expression using the `+` operator, Python will throw an error, because it can't add an integer or float and a string, and it can't concatenate a string and an integer or float. 

In [18]:
city + 42

TypeError: can only concatenate str (not "int") to str

In [19]:
city * 3

'CologneCologneCologne'

There are quite a lot of things we can do to strings in Python. If you want to learn a bit more about string manipulation right now, you can consult [Chapter 6: Manipulating Strings](https://automatetheboringstuff.com/chapter6/) from *Automate the Boring Stuff*. Alternatively, you can wait. We will introduce various methods for string manipulation throughout the course. 

# <i class="fa fa-location-arrow"></i> FLOW CONTROL AND CONDITIONALS <a id='conditional'></a>

We have already seen how to tell our computer to execute individual instructions, such as evaluate the expression `2 + 2`. Most of the time, however, we don't want our computer to simply execute a series of individual instructions one after the other. Instead, we want to be able to tell our computer to execute instructions *depending on some condition*. This is 'flow control'. 

Flow control statements usually include a 'condition' (which evaluates to a Boolean value: True or False) and are followed by a 'clause' which is an indented block of code to execute depending on the Boolean value of the condition.

Let's make this less abstract with a simple example. 

The cell below executes a simple `if` control flow statement. First, it prompts you to enter the name of this course into a box. Then it executes the statement. In the cell below the code block, translate this control flow statement into plain English. 

In [20]:
print('Please type the name of this course into the box below. ')
course = input()

if course == 'Fundamentals of Data Analysis with Python':
    print('Welcome to the course!')
elif course == 'A Practical Introduction to Machine Learning in Python':
    print('You are in the wrong course! That course starts next week.')
elif course == 'Social Network Analysis with Digital Behavioural Data':
    print('You are in the wrong course! That course starts in two weeks.')
else:
    print('🙄 🤔')

Please type the name of this course into the box below. 
Fundamentals of Data Analysis with Python
Welcome to the course!


## <font color="crimson"><i class="fa fa-user"></i> YOUR TURN!</font> 

The cell below executes a simple `while` control flow statement. In the cell below the code block, translate this control flow statement into English (or another language you are comfortable with). 

In [21]:
day = 1

while day <= 5:
    print("It's Day {}. The course is still in progress.".format(day))
    day = day + 1

print('\nThe course is complete. Congratulations!')

It's Day 1. The course is still in progress.
It's Day 2. The course is still in progress.
It's Day 3. The course is still in progress.
It's Day 4. The course is still in progress.
It's Day 5. The course is still in progress.

The course is complete. Congratulations!


In [None]:
# Your Answer Here 





# <i class="fa fa-location-arrow"></i> FUNCTIONS <a id='functions'></a>

So far we have used a few functions that are built in to Python, such as `print()` and `len()`. We can also write our own functions, which let us execute small chunks of code. Writing our own functions is a very powerful way of compartmentalizing and organizing our code. 

User-defined functions can be as simple or as complex as we like (although you should strive to design functions that are **as simple as possible**). For example, the following cell defines a function called `welcome()`, which accepts a name and extends a welcome greeting. 

In [24]:
def welcome(name):
    print('Hello, {}. Welcome to Fundamentals of Data Analysis with Python!'.format(name))

In [25]:
welcome('Miyoko')

Hello, Miyoko. Welcome to Fundamentals of Data Analysis with Python!


We will introduce more sophisticated functions throughout the course. For now, we want to make just a few quick points about developing functions. 

First, you should always start by thinking carefully about what problem you are trying to solve. In the process, you may realized that what you thought was one problem is actually several related small problems. As you think things through, decompose the big problems into smaller problems, and think about how those problems relate to one another. 

Second, it is usually a good idea to start by writing "[pseudocode](https://en.wikipedia.org/wiki/Pseudocode)", which is a bit like starting a writing project (e.g. a journal article) with an outline. Doing this will carefully and thoughtfully will result in much better code, and frankly better data analysis. 

Third, in a data collection, cleaning, and analysis context, you will likely be doing a fair amount of exploratory work at the start of a new project. While I don't presume to speak for anyone else, I find it helpful to be doing this open ended exploration **at the same time** as I think through the problems I need to solve in a data analysis. I usually do this in a Jupyter Notebook (like this one) while I write notes in a text file or a physical notebook. I develop functions and refactor my code later, when I have a clear sense of what I want to do and how it should to be done. To continue the writing analogy, this is a bit like figuring out what your idea or argument is by doing free writing, developing an outline as your thinking evolves. Then, at some point, you open a new file and start writing cleaner more structured text. 

## <font color="crimson"><i class="fa fa-user"></i> YOUR TURN!</font>

In the cell below the code block, translate the function `welcome_message()` into English or another language you speak. How does the returned string change depending on the values passed into the function?

In [26]:
def welcome_message(name, course):
    if course == 'fundamentals':
        msg = 'Hello {}, welcome to Fundamentals of Data Analysis with Python.'.format(name)
    elif course == 'machine_learning':
        msg = 'Hello {}, welcome to A Practical Introduction to Machine Learning in Python.'.format(name)
    elif course == 'networks':
        msg = 'Hello {}, welcome to Social Network Analysis with Digital Behavioral Data.'.format(name)
    else:
        msg = 'Sorry, I think you might be in the wrong course.'
    
    return msg

In [27]:
welcome = welcome_message('Karamo', 'fundamentals')
print(welcome)

Hello Karamo, welcome to Fundamentals of Data Analysis with Python.


# <i class="fa fa-location-arrow"></i> LISTS, TUPLES, AND DICTIONARIES <a id='lts'></a>

This section covers slightly more complex data structures: `lists`, `tuples`, and `sets`. Let's start with `lists`. 

## Lists

`lists` can contain multiple values in an ordered sequence, such as `['fundamentals', 'of', 'data', 'analysis', 'with', 'python']`, `['GESIS', 'Leibniz Institute for Social Sciences']`, or `['42', '77', 'mix', 'data', 'types', 42, 77]`. 

In [28]:
course = ['fundamentals', 'of', 'data', 'analysis', 'with', 'python']

In [29]:
course

['fundamentals', 'of', 'data', 'analysis', 'with', 'python']

In [30]:
course_joined = " ".join(course)
course_joined

'fundamentals of data analysis with python'

In [31]:
course_split = course_joined.split(" ")
course_split

['fundamentals', 'of', 'data', 'analysis', 'with', 'python']

In [32]:
course == course_split

True

In [33]:
print('There are {} words in the course title.'.format(len(course)))

There are 6 words in the course title.


In [34]:
for word in course:
    print(word)

fundamentals
of
data
analysis
with
python


In [35]:
course[0]

'fundamentals'

In [36]:
course[1]

'of'

In [37]:
course[2]

'data'

In [38]:
course[3]

'analysis'

In [39]:
course[4]

'with'

In [40]:
course[5]

'python'

In [41]:
course[-1]

'python'

In [42]:
course[-2]

'with'

In [43]:
for i in range(len(course)):
    print(course[i])

fundamentals
of
data
analysis
with
python


In [44]:
course.append('John McLevey')
course.append('Jillian Anderson')
course.append('GESIS')

In [45]:
course

['fundamentals',
 'of',
 'data',
 'analysis',
 'with',
 'python',
 'John McLevey',
 'Jillian Anderson',
 'GESIS']

In [46]:
course[6:]

['John McLevey', 'Jillian Anderson', 'GESIS']

In [47]:
course[-2]

'Jillian Anderson'

In [48]:
course.index('GESIS')

8

## <font color="crimson"><i class="fa fa-user"></i> YOUR TURN!</font> 

In the cell below, change the value of `GESIS` to `Leibniz Institute for Social Sciences` in the list `course` using list indices. 

In [None]:
# Your Answer Here




Let's see if it worked. 

In [49]:
if 'Leibniz Institute for Social Sciences' in course:
    print('Not yet. Try again!')
else:
    print('Changed! Well done.')

Changed! Well done.


In [50]:
countries = ['Canada', 'Germany']
countries

['Canada', 'Germany']

It is also possible to store lists inside of lists.  

In [51]:
who = [countries, ['University of Waterloo', 'Leibniz Institute']]
who

[['Canada', 'Germany'], ['University of Waterloo', 'Leibniz Institute']]

In [52]:
for each in who:
    print(each)

['Canada', 'Germany']
['University of Waterloo', 'Leibniz Institute']


In [53]:
for each in who:
    for value in each:
        print(value)
    print('\n')

Canada
Germany


University of Waterloo
Leibniz Institute




## Tuples

`tuples` are very similar to `lists` except that they are 'immutable' (can't be changed) and values are stored in between `()` rather than `[]`. 

In other words, we can modify the values of a `list` because `lists` are mutable (can be changed). We can't modify the values of a `tuple`, because `tuples` are immutable. Using tuples can speed up your code considerably.

In [54]:
course_list = course.copy()
course_list

['fundamentals',
 'of',
 'data',
 'analysis',
 'with',
 'python',
 'John McLevey',
 'Jillian Anderson',
 'GESIS']

In [55]:
course_tuple = tuple(course)
course_tuple

('fundamentals',
 'of',
 'data',
 'analysis',
 'with',
 'python',
 'John McLevey',
 'Jillian Anderson',
 'GESIS')

In [56]:
course_list.sort()
course_list

['GESIS',
 'Jillian Anderson',
 'John McLevey',
 'analysis',
 'data',
 'fundamentals',
 'of',
 'python',
 'with']

In [57]:
course_list.sort(reverse=True)
course_list

['with',
 'python',
 'of',
 'fundamentals',
 'data',
 'analysis',
 'John McLevey',
 'Jillian Anderson',
 'GESIS']

In [58]:
course_tuple.sort()
course_tuple

AttributeError: 'tuple' object has no attribute 'sort'

# Dictionaries

`Dictionaries` are a flexible way to organize data. `Dictionaries` contain `key`-`value` pairs. For example, the following `dictionary` contains information about Simon Fraser University, where Jillian currently works as a data engineer. The `keys` are then name of the university, the province, the country, the cities with of main and sattalite campuses, and the number of enrolled students in 2015. 

In [59]:
uni1 = {
    'name': 'Simon Fraser University',
    'province': 'British Columbia',
    'country': 'Canada',
    'cities': ['Burnaby', 'Surrey', 'Vancouver'],
    '2015enrollments': 34990
}

In [60]:
uni1

{'name': 'Simon Fraser University',
 'province': 'British Columbia',
 'country': 'Canada',
 'cities': ['Burnaby', 'Surrey', 'Vancouver'],
 '2015enrollments': 34990}

In [61]:
uni1['cities']

['Burnaby', 'Surrey', 'Vancouver']

In [62]:
uni1['2015enrollments']

34990

In [63]:
for city in uni1['cities']:
    print(city)

Burnaby
Surrey
Vancouver


In [64]:
for v in uni1:
    print(v)

name
province
country
cities
2015enrollments


In [65]:
for k,v in uni1.items():
    print(v)

Simon Fraser University
British Columbia
Canada
['Burnaby', 'Surrey', 'Vancouver']
34990


In [66]:
for each in uni1.values():
    print(each)

Simon Fraser University
British Columbia
Canada
['Burnaby', 'Surrey', 'Vancouver']
34990


As you can see, we can use `lists` inside of `dictionaries` and `lists` inside other `lists`, we can have `dictionaries` inside `lists` and even `dictionaries` inside of other `dictionaries`. Combining data structures this way enables us to model the real world in powerful and flexible ways. 

Here's another dict with information about the University of Waterloo, where John works. The keys are the same as the previous dict. 

In [67]:
uni2 = {
    'name': 'University of Waterloo',
    'province': 'Ontario',
    'country': 'Canada',
    'cities': ['Waterloo', 'Kitchener', 'Stratford', 'Cambridge'],
    '2015enrollments': 36670
}

We can add our two dicts to a list, or another data structure. 

In [70]:
universities = [uni1, uni2]

In [71]:
universities

[{'name': 'Simon Fraser University',
  'province': 'British Columbia',
  'country': 'Canada',
  'cities': ['Burnaby', 'Surrey', 'Vancouver'],
  '2015enrollments': 34990},
 {'name': 'University of Waterloo',
  'province': 'Ontario',
  'country': 'Canada',
  'cities': ['Waterloo', 'Kitchener', 'Stratford', 'Cambridge'],
  '2015enrollments': 36670}]

In [72]:
for uni in universities:
    print(uni['name'], uni['province'], uni['country'])

Simon Fraser University British Columbia Canada
University of Waterloo Ontario Canada


## <font color="crimson"><i class="fa fa-user"></i> YOUR TURN!</font> 

In the cell below, (1) create a dictionary called `me` with the following `keys`: 'first_name', 'last_name', 'discipline' (e.g. political science), 'research_interests',  and 'programming_experience_level'. Enter the correct values. Once you have created this dictionary, (2) print each value to the screen using a for loop. 

In [73]:
# Your Answer Here 





# Open Work Time <a id='open'></a>