# Session 1b: Python and Social Data Science

*Joachim Kahr Rasmussen*



## Agenda

1. [Module Structure](#Module-structure)
2. [Data Science and Beyond](#Data-science-and-beyond)
3. [Python - an Overview](#Python-\--an-Overview)
4. [The Python Language](#The-Python-Language)

# Associated Readings

BBB, chapter 1:
- Social science 'vs' data science
- The Digital Age: Availability of data and computer power
- Intro to ethics

Grimmer (2015):
- Descriptive vs causal inference
- Importance of research design
- New data: Magnitude and type

Insiprational readings:
- "What is code?": Maybe save it for a rain day.
- "The end of theory: The data deluge makes the scientific method obsolete": Polemic post about data and models.

# Module Structure

## The Structure of Classes

Most teaching modules will have the following structure

- Before lectures and exercise classes: 
    - Do the **reading** $-$ we have some quality textbooks!
    - Watch **previous recorded lectures** $-$ it takes more than one view to get under your skin

- Lectures: 
    - Introduces **new material**, but we are trying something slightly different this year
    - The **format will vary** for each module: Generally...
        - **Live** recap and introduction
        - **Questions** from last time
        - Technical content is **recorded** for you to...
            - watch at **preferred pace**, and
            - potentially ***while* solving exercises**
    - When watching videos, you can ALWAYS **ask question**s.

- Exercise class:
    - Continue working on **exercises**
    - **Discuss with TA** $-$ use the fact that they are there with the sole purpose of helping you learning

## The Academic Quarter

In case you don't know...
- 9 means 9.15, 
- 13 means 13.15 (i.e. 1.15pm)

This holds for both exercise classes and lectures!

## Learning Outcomes After Completing Intro SDS 

Main elements in this course:
- Tidy / transform: Data structuring and text (sessions 2-3, 15)
- Visualize: Plotting (session 4)
- Import: Scraping and data IO (sessions 5, 6-8)
- Ethics: Rules and moral considerations for working with data  (session 9)
- Model: Fundamentals of machine learning, application to text (session 10-14)

Also a tiny bit on...
- Communication and access: Git and Markdown (should have been covered briefly by Andreas earlier today)

## What We Don't Teach You Now

Many more courses that built on this
- Statistics: Econometrics and Machine Learning - overlaps
- More about data - modelling and processing:
    - Text, networks/relational, spatial
- Non-linear ML models:
    - Tree based and kernel 
    - Neural networks
- ML for dynamic decisions: reinforcment learning
- All about privacy

In particular, check out [Advanced SDS I](https://kurser.ku.dk/course/asdk20004u/2020-2021) and [Advanced SDS II](https://kurser.ku.dk/course/asdk20006u/2020-2021). Also sometimes possible to write a project (BA, MSc, [seminar](https://kurser.ku.dk/course/a%c3%98kk08411u/2020-2021)).

# Data Science and Beyond

## Why Data Science Now?

Three trends are important for understand the increasing interest in and influence of data science:

- **Data** is increasingly available, e.g.
    - *Social Media*: Facebook status updates, public 'tweeting', Instagram pictures, etc.
    - *Shopping*: Online shopping (track everything, make experiments), in store shopping (memberships, less granular data).
    - *Phones*: Patterns in phone calls, GPS logging, phone activity (see recent SODAS studies), etc.
- Faster and bigger **computers** ([Waldrop, 2016](https://pubmed.ncbi.nlm.nih.gov/26863965/)):
    - Moore’s law $\sim$ transistors on a microprocessor chip doubles every two years 
    - Soon coming to a halt due to heat issues...
- Improved **algorithms**, methods for computation amd accessability of tools
    - Example: Development of effective and accessible libraries in python
    - Much more about this later in the course...



## Examples of Major Advances

Innovations from data science already create enormous amounts of value and make our lives much easier. Some examples:
- Autonomous systems: 
    - self-driving cars
    - computer game bots, 
    - trading bots, etc.
- Image and text recognition: 
    - face recognition (Facebook, police)
    - language parsing with e.g. Google Translate/GPT-3 (Grammarly, auto-correct)
- Combined services: 
    - Virtual assistants (banks, medical diagnosis)
    - recommendation systems (amazon, spotify, netflix)
    - classification systems (spam, tax fraud, plagiarization)

[McKinsey (2018)](https://www.mckinsey.com/~/media/McKinsey/Featured%20Insights/Artificial%20Intelligence/Notes%20from%20the%20frontier%20Modeling%20the%20impact%20of%20AI%20on%20the%20world%20economy/MGI-Notes-from-the-AI-frontier-Modeling-the-impact-of-AI-on-the-world-economy-September-2018.ashx#:~:text=In%20the%20aggregate%2C%20and%20netting,about%201.2%20percent%20a%20year.&text=The%20economic%20impact%20may%20emerge%20gradually%20and%20be%20visible%20only%20over%20time.): 'Artificial Intelligence' has the potential to increase global GDP by 1.2 percent per year

## Past the Peak? Illustrative example
For a couple of years: Data scientists had HIGHEST entry wages in DK.

More recent evidence: Not top, but still high...

*Why did the mean relative data scientist entry wage decline?*

Paradoxically, this is where *you* as social scientists come into the picture!

Key issue: Prediction based agenda is flawed. But opens up new opportunities:

- Combine with theory: Supply and demand $-$ did supply catch up? Selection?
- Combine with causal inference: Instruments, regression discontinuity, matching...

Clearly a role for econometrics and structural modelling!

## Social Data Science (I/II)

Social data scientists combine skills and tools from two different fields:
- Data scientists in a nutshell (BBB):
    - Developing algorithms that are fast and flexible
    - Largely concerned with prediction $-$ not causal inference
- Social scientist in a nutshell (BBB):
    - Study human behavior and interactions
    - Causal inference is important for understand implications of e.g. policies


## Social Data Science (II/II)

We may say that the skills and ideas of data science are spreading to social data science:
- smart, free tools for working with
    - small and big data on structured (tabular) data
    - unstructured data sources from image, text and social media
- incorporate machine learning into
    - statistics and causal inference
    - economic modelling

In particular, data science complements social sciences by:
- enhancing existing fields, and
- oppening up for new field emerging (new data, combination of methods)

# Python - an Overview

## Introducing Python

*What is Python useful for?*

* It can do "anything" and [used everywhere](https://www.python.org/about/success/)
    * High-tech manufacturing
    * Space shuttles 
    * Large servers     
* Python has incredible resources for machine learning, big data, visualizations.

## Use is Evidently Trending

<center><img src=https://grapecitycontentcdn.azureedge.net/blogs//grapecity/20181026-the-growth-of-major-programming-languages/3.jpg' alt="Drawing" style="width: 900px;"/></center>

## What and Why?

*What is Python?*

A multiparadigm, general purpose programming language.
  * Can do everything you can imagine a computer can do.
  * E.g. manage databases, advanced computation, web etc.
 

*Why Python?*

Python's main objective is to make programming more ***effortless***. 
- This is done by making syntax intuitive.
- A side effect: programming can be fun
- Downside: not the fastest (solved with packages)

## Other Programs and Languages (I/II)

*Is Python the most popular for statistics and data science?*

There are other good languages, e.g. R, Stata or SAS, why not use them?

- Python has the best data science packages.
- And it is also being used increasingly in statistics.

Each program/language has their own advantages and similarities (our opinion)

|                     | Python | R | Julia |
|---------------------|--------|---|-------|
| Data structuring    | X      | X | |
| Plotting            | X      | X | |
| Machine learning    | X      |   | |
| Statistics          |        | X | |
| General programming and modelling | X      |   | (X)|
| Ease of learning    | X      |   | |

Other ('statistical') programming languages: SAS, Stata, ...


## Other Programs and Languages (II/II)

Tools are increasingly integrated
- Jupyter; a shared framework for data science 
- New software allows direct execution side-by-side: use R within python (vice versa)
- New tools becoming available across languages, e.g. data processing engine (arrow)

Advice: don't worry. It's likely you need to learn more than one language.

Advice: don't worry. It's likely you need to learn more than one language.

## The Wheel of Data Science

*How does data science work?*

<br>
<br>
<center><img src='https://raw.githubusercontent.com/hadley/r4ds/master/diagrams/data-science.png' alt="Drawing" style="width: 700px;"/></center>

## Learning How to Code... Not a Free Lunch

This course... it ain't easy!

<center><img src='https://media.giphy.com/media/h36vh423PiV9K/giphy.gif' alt="Drawing" style="width: 300px;"/></center>

But at least you get a lot of help!

Learning without supervision, you may struggle with simple stuff...

## Some Encouragement

#### Hadley Wickham

> The bad news is that when ever you learn a new skill **you’re going to suck**. It’s going to be **frustrating**. The good news is that is typical and **happens to everyone** and it is **only temporary**. You can’t go from knowing nothing to becoming an expert without going through a **period of great frustration** and great suckiness.


#### Kosuke Imai

> One can learn data analysis **only by doing**, not by reading.

## Light at the End..

Why would you go through this pain? You choose one of two paths after this course...

i. You move on, you forget some or most of the material.

ii. You are lit and your life has changed. 
- You may return to become a better sociologist, anthropologit, economist etc.
- Or, you may continue along the new track of data science.
- In any case, you keep learning and expanding your programming skills.

## Advice for Coding

Three pieces of advice that will take you far!

1. Be careful: Think before you code $-$ what you are trying to make it do?

2. Be lazy: Reuse code and write Reusable code (e.g. functions)

3. Make understandable: Think about audience!
    1. Future you? May not recall this at all.    
    1. Group members or world? May not understand!
    1. Write lots of comments and potentially background explanation/documentation
    


*How do you get there?*

- Maintain healthy curiousity  $-$  how could we do things better?
- Practice and try as much as possible
- Type the code in yourself  $-$ then you see what is going on.


## Help and Advice


Whenever you have a question you do as follows:


1: You ask other people in your group.


2: You search on Google (more advice will follow).

3: You ask the neighboring groups.

4: You raise an [issue in our Github repo](https://github.com/abjer/sds/issues) or you ask us.

# The Python Language


## The Python Shell

The fundamental way of accessing Python is from your shell by typing *`python`*. On a Windows computer, this would simply be the command prompt.

Everyone should be able to run the following commands and reproduce the output.

``` python
>>> print ('hello my friend')
hello my friend
```

``` python
>>> 4*5
20
```

You can leave Python again by simply typing *`quit()`*.

If you want to close the prompt, simply type *`exit()`*.

## The Python Script (I/III)

The power of the interpreter is that it can be used to execute Python scripts. 

*What is a script?* 

These are programs containing code blocks.

## The Python Script (II/III)

Everyone should be able to make a text file called *`test.py`* on the desktop or in some folder with some content and run it.

The file should contain the following two lines:

Everyone should be able to make a text file called *`test.py`* on the desktop or in some folder with some content and run it.

The file should contain the following two lines:

``` python
print ('Line 1')
print ('Line 2')
```

I saved mine on the desktop.

Reopen the prompt and type something equivalent to...

``` python
cd "C:\Users\xtw562\Desktop"
python test.py
```

This should now yield the following output:

```
Line 1
Line 2
```

## The Python Script (III/III)

Now, everyone should be able to make a text file called *`test.py`* in their current folder with some content and run it. Current folder?

``` python
>>> import os
>>> print("Current working directory: {0}".format(os.getcwd()))
Current working directory: C:\Users\xtw562
```

Now choose your specify your working directory...

``` python
>>> os.chdir('C:/Users/xtw562/Desktop')
>>> print("Current working directory: {0}".format(os.getcwd()))
Current working directory: C:\Users\xtw562\Desktop
```

Now that we have chosen the working directory, the file should contain the following two lines:

Try executing the test file from the shell by typing:

*`python test.py`*

This should yield the following output:

```
Line 1
Line 2
```

## The Jupyter framework (I/III)

*What is Jupyter Notebook?*

- Jupyter provides an interactive and visual platform for working with data. 
- It is an abbreviation of Julia, Python, and R.

*Why Jupyter notebook?*

- great for writing. 
    - markdown, equations and direct visual output;
- interactive allows keeping, changing data etc.
- many tools (e.g. create this slideshow)

*How do we create a Jupyter Notebook?*

We start Jupyter Notebook by typing *`jupyter notebook`* in the shell.

Try making a new notebook: 
- click the button *`New`* in the upper right corner 
- clicking on *`Python 3`*.

## The Jupyter framework (II/III)


*How do we interact with Jupyter?*

Jupyter works by having cells in which you can put code. The active cell has a colored bar next to it.

A cell is *`edit mode`* when there is a <span style="color:green">*green*</span> bar to the left. To activate *`edit mode`* click on a cell.

A cell is in *`command mode`* when the bar to the left is <span style="color:blue">*blue*</span>.

*How do we add and execute code?*

Go into edit mode - add the following:

In [1]:
A = 11
B = 25

A*B 

275

Click the &#9658; to run the code in the cell. What happens if we change A+B to A*B?

## The Jupyter framework (III/III)


*How can we add cells to our notebook?*

Try creating a new cell by clicking the **`+`** symbol.

*Some relevant keyboard short cuts?*

Editing and executing cells
- enter edit mode: click inside the cell or press `ENTR`
- exit edit mode: click outside cell or press `ESC`.
- executing code within a cell is `SHFT`+`ENTR` or `CTRL`+`ENTR` (not same!)


Adding cell (`a` above, `b` below) and removing cells (press `d` twice)  

More info:
- For tips [see blog post](https:abjer.github.io/sds2019/post/jupyter) or see list Jupyter keyboard shortcuts in menu (top): `Help > Keyboard Shortcuts`.
- General resources in documentation and tutorial available [here](http://jupyter.readthedocs.io/en/latest/).

## Before Looking Into More Advanced Concepts...

... we begin with a quiz!

### Go to [kahoot.it]()

## Fundamental data types (I/II)

Recall the four fundamental data types: `int`, `float`, `str` and `bool`.
- Sometimes known as elementary, primitive or basic.

Some data types we can change between, e.g. between `float` and `int`.

In [2]:
int(1.6) # integer conversion always rounds down, i.e. floor 

1

In [3]:
float(int(1.6)) # it does not retake its former value

1.0

We can do the same for converting to `float` and `int` to `str`. Note some conversion are not allowed.

## Fundamental data types (II/II)

*What is an object in Python?*

- A thing, anything - everything is an object.

*Why use objects?*

- Easy manipulable, powerful methods and flexible attributes. 
- We can make complex objects, e.g. estimation methods quite easy.
- Example of a float method:

In [4]:
(1.5).as_integer_ratio()

(3, 2)

## Debugging (I/III)

*Code fails all the time!*


In [5]:
A='I am a string'
int(A)
print(A)

ValueError: invalid literal for int() with base 10: 'I am a string'

## Debugging (II/III)

*How do you fix code errors?*


Look at the error message:
1. **Where** is the error? I.e. what linenumber (and which function).
  - Inspect the elements from the failure before the error occurs. 
     - Note: if you use a function you may want to try printing elements
  - Try replacing the objects in the line.
  
2. **What** goes wrong? Examples:
  - `SyntaxError`: spelling error; `ValueError`: datatype mismatch.
  - Hint: reread it several times, search on Google if you do not understand the error.


## Debugging (III/III)

*Exercise: investigate the error we incurred*


* Look at the answers in this stackoverflow post: [https://stackoverflow.com/questions/8420143](https://stackoverflow.com/questions/8420143).
* An explanation by Blender:
> Somewhere in your text file, a line has the word `id` in it, which can't really be converted to a number.


## Operators

*What computations can python do?*

- Numeric operators: Output a numeric value from numeric input. 
    - `+`; `*`; `-`; `/`.
- Comparison operators: Output a boolean value, `True` or `False`
    - `==`; `!=` (equal, not equal - input from most object types)
    - `>`; `<`. (greater, smaller - input from numeric)
- Logical operators: Output a boolean value from boolean input.
    - `and` / `&`; `or` / `|`; `not` / `!`



*How can we test an expression in Python?*

We can check the validity of a statement using comparison operations:

In [6]:
3 == (2 + 1) # other ops: >, !=, >=

True

And apply logical operations:

In [15]:
True & False # When part is false, logically the whole statement is false.

False

## Control flow (I/II)

*How can we activate code based on data?*


A conditional execution of code, if a condition is true then active code.

In Python the syntax is easy with the `if` syntax:

```
if condition:  
    (CODE BLOCK LINE 1)
    (CODE BLOCK LINE 2)
    ...
```

Condition is either a variable or an expression. If statement is `True` then execute a code block.

## Control flow (II/II)

We can use comparison and logical operators directly as they output boolean values.

In [16]:
if 4 == 4:  
    print ("I'm being executed, yay!")
else:      
    print ("Oh no, I'm not being executed!")

I'm being executed, yay!


We can make deep control flow structures:

In [17]:
A = 11
if  A>=0:      
    if A==0:
        print ("I'm exactly zero!")
        
    elif A<10:
        print ("I'm small but positive!")
        
    else:
        print ("I'm large and positive!")
else:      
    print ("Oh shoot, I'm negative!")

I'm large and positive!


## Containers

*How do we store multiple objects?*

- We put objects into containers. (Like a bag)
- An example is a `list` where we can add and remove objects

*What are they useful for?*

- We can use them to compute statistics (max, min, mean)

## Sequential containers

*Which data types are ordered?*

- Sequential containers are ordered from first to last. 
- They can be accessed using their element using integer position/order.
    - Done with square bracket syntax `[]` 
    - Note **first element is 0, and last is n-1!**
    - One exception are iterators (`iter`) which are incredibly fast.

*Which containers are sequential?*

- `list` which we can modify (**mutable**).
    - useful to collect data on the go
- `tuple` which is after initial assignment (**immutable**)
     - tuples are faster as they can do less things
- `array` 
    - which is mutable in content (i.e. we can change elements)
    - but immutable in size
    - great for data analysis


## Lists

A list can be modified (mutated) by methods, e.g.
- We can `append` objects to it and remove `remove` them again.
- We can use operations like `+` and `*`.


In [18]:
list_1 = ['A', 'B']
list_2 = ['C', 'D']
list_1 + list_2  

['A', 'B', 'C', 'D']

## Non-sequential types

*Are there any non-sequential containers?*
- A dictionary (`dict`)  which are accessed by keys (immutable objects).
    - Focus of tomorrow.
- A `set` where elements are
    - unique (no duplicates) 
    - not ordered
    - disadvantage: cannot access specific elements!


## For loops

*Why are containers so powerful?*



We can iterate over elements in a container -  this creates a *finite* loop, called the `for` loop. 

Example - try the following code:

In [19]:
A = []
for i in range(4):
    i_squared = i**2
    A.append(i_squared)

for a in A:
    print(a)

0
1
4
9


For loops are smart when: iterating over files in a directory; iterating over specific set of columns.

How does Python know where the code associated with inside of the loop begins?

## The one line loop

*What is the fastest way to write a loop?*

Using list comprehension (also work for containers):

In [20]:
A = [i**2 for i in range(4)]
print(A)

[0, 1, 4, 9]



## While loops

*Can we make a loop without specifying the end?*

Yes, this is called a `while` loop. Example - try the following code:


In [21]:
i = 0
L = []
while (i<3):
    L.append(i*2)
    i += 1
print(L)

[0, 2, 4]


Applications
- Can be applied in scraping, model which converges, etc.
- Make server process that keeps running

## Making Code Reusable


### Functions
Be careful of how you define input/outputs and how objects are created. In particular:

- Globals: Objects defined outside function
    - Note: These are available both locally and globally (although, they can be overwritten)
- Locals: Objects that are created within a sub-level. 
    - Example: objects defined inside a function
    - Cannot use outside, unless we use `return`

### Checklist for Functions
Some key questions that you should often ask yourself:
- Should I write a function?
    - If you are repeating some process: Yes!
    - Makes code easier to write (better overview, less mistakes) and read
- If yes:
    - What do you intend the function to output? Is the correct output returned?
    - Do you use locals where possible? 
    - Are globals assigned before you define the function in notebook/script?
        - (if not, your function may fail! Reason: some used globals are not defined)


### Classes 
Where do objects come from?

- From a collection of pre-defined attributes and methods $-$ this is called a `class`.
- We can make our own. It is complex, but powerful.
- Examples of classes/objects we will see: 
    - Dataframes, scraping tools, machine learning models etc.
    - (everything is an object)

## Copy vs. View

Important, when writing code `A = B`, then `A` is only a reference to `B`!

- In other words: `A` is a **view** of `B`
- Implication 
    - if `A` is mutable, e.g. list, dataframe: changes to `B` shows up in `A` and vice versa 

We can break this dependency by explicitly making **copy**. 
- For instance, in pandas use `A = B.copy()` method



## Coding That is Fast

### One-liners using comprehensions

A very compact way of writing code: *comprehension*. Example of list comprehension:
```python
new_list = [my_proc(a) for a in my_list] 
```

The new thing is that we define loop inside the list! 


In [22]:
# Define function
def my_proc(a):
    return a**2

# Generate a list
my_list = []
for i in range(10):
    my_list.append(i+1)

# Use line from before to generate transformation 
new_list = [my_proc(a) for a in my_list]

# Print lists
print('my_list:', my_list)
print('new_list:', new_list)

my_list: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
new_list: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


### The Need for Speed
Python is elegant and simple

However, Python is NOT built for speed. 

We can compensate by using smart packages that have fast algorithms, e.g. numpy and pandas


### Vectorization

An alternative is to write our code in terms of numpy arrays.

In example below, we generate a long list by making some transformation of an input... we get around 30 times speed-up by using numpy!

In [23]:
%%timeit 
[i+3 for i in range(10**7)]

821 ms ± 21.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [24]:
import numpy as np

In [25]:
%%timeit
np.arange(10**7)+3

24.8 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Take Away on Speed

Use pandas or numpy - they are optimized for speed! 

Why are pandas and numpy fast? 

- Because they are written in fast, low-level code with optimized algorithms! 

Why are we not learning low level? 
- Requires too much space for simple operations - not efficient!
- Too steep learning curve (not like Python!!)

# Outro

- Coding is tough, but worth learning
    - Also for social scientists
- You can dig deeper into
   - Advanced python
   - Git for saving and sharing code
   
- Next lecture this afternoon: Strings, queries and APIs