# Python and Social Data Science

# This talk

- Why social data science 
- Course overview
- Learning to code
- Python 
   - Why we chose it
   - Advanced concepts
- Power tools: git and markdown

## Data science - an overview

Three trends are important for understand the increasing interest in and influence of data science:

- **Data** is increasingly available, e.g.
    - *Social Media*: Facebook status updates, public 'tweeting', Instagram pictures, etc.
    - *Shopping*: Online shopping (track everything, make experiments), in store shopping (memberships, less granular data).
    - *Phones*: Patterns in phone calls, GPS logging, phone activity (see recent SODAS studies), etc.
- Faster and bigger **computers** ([Waldrop, 2016](https://pubmed.ncbi.nlm.nih.gov/26863965/)):
    - Moore’s law $\sim$ transistors on a microprocessor chip doubles every two years 
    - Soon coming to a halt due to heat issues...
- Improved **algorithms**, methods for computation amd accessability of tools
    - Example: Development of effective and accessible libraries in python
    - Much more about this later in the course...



## Examples of Major Advances

Innovations from data science already create enormous amounts of value and make our lives much easier. Some examples:
- Autonomous systems: 
    - self-driving cars
    - computer game bots, 
    - trading bots, etc.
- Image and text recognition: 
    - face recognition (Facebook, police)
    - language parsing with e.g. Google Translate/GPT-3 (Grammarly, auto-correct)
- Combined services: 
    - Virtual assistants (banks, medical diagnosis)
    - recommendation systems (amazon, spotify, netflix)
    - classification systems (spam, tax fraud, plagiarization)

[McKinsey (2018)](https://www.mckinsey.com/~/media/McKinsey/Featured%20Insights/Artificial%20Intelligence/Notes%20from%20the%20frontier%20Modeling%20the%20impact%20of%20AI%20on%20the%20world%20economy/MGI-Notes-from-the-AI-frontier-Modeling-the-impact-of-AI-on-the-world-economy-September-2018.ashx#:~:text=In%20the%20aggregate%2C%20and%20netting,about%201.2%20percent%20a%20year.&text=The%20economic%20impact%20may%20emerge%20gradually%20and%20be%20visible%20only%20over%20time.): 'Artificial Intelligence' has the potential to increase global GDP by 1.2 percent per year

## Past the Peak? Illustrative example (I/II)
For a couple of years: Data scientists had HIGHEST entry wages in DK.

More recent evidence: Not top, but still high...

Question: *Why did the mean relative data scientist entry wage decline?*

## Past the Peak? Illustrative example (II/II)

Paradoxically, this is where *you* as social scientists come into the picture!

Key issue: Prediction based agenda is flawed. But opens up new opportunities:

- Combine with theory: Supply and demand $-$ did supply catch up? Selection?
- Combine with causal inference: Instruments, regression discontinuity, matching...

Clearly a role for econometrics and structural modelling!

## Social Data Science

Social data scientists combine skills and tools from two different fields:
- Data scientists in a nutshell (BBB):
    - Developing algorithms that are fast and flexible
    - Largely concerned with prediction $-$ not causal inference
- Social scientist in a nutshell (BBB):
    - Study human behavior and interactions
    - Causal inference is important for understand implications of e.g. policies

The skills and ideas of data science are spreading to social data science:
- smart, free tools for working with
    - small and big data on structured (tabular) data
    - unstructured data sources from image, text and social media
- incorporate machine learning into
    - statistics and causal inference
    - economic modelling
   

Data science complements social sciences by:
- enhancing existing fields, and
- oppening up for new field emerging (new data, combination of methods)



# Course Overview

## Module Structure

Most teaching modules will have the following struture

1. Before exercise class: 
    - Watch previous recorded lectures and do the reading
    - If you have time, you may attempt to solve exercises
2. Exercise class:
    - Continue working on exercises
    - Discuss with TA $-$ use the fact that they are there with the sole purpose of helping you learning
3. Lectures: 
    - Introduces new material, but a bit different due to Covid 19
    - The format will vary for each module: Generally...
        - Live recap and introduction
        - Technical content is recorded for you to watch at preferred pace
    - While watching videos, you can ALWAYS ask questions.

## The Wheel of Data Science
<br>
<br>
<center><img src='https://raw.githubusercontent.com/hadley/r4ds/master/diagrams/data-science.png' alt="Drawing" style="width: 700px;"/></center>


## Learning Outcomes After Completing Intro SDS 

- Tidy / transform: Data structuring and text (sessions 1-5,15)
- Import: Scraping and data IO (sessions 2, 6-8)
- Visualize: Plotting (session 3)
- Model: Fundamentals of machine learning, application to text (session 11-15)
- Communicate: Git and markdown (covered briefly at end of talk)

## What We Don't Teach You Now

Many more courses that built on this
- Statistics: Econometrics and Machine Learning - overlaps
- More about data - modelling and processing:
    - Text, networks/relational, spatial
- Non-linear ML models:
    - Tree based and kernel 
    - Neural networks
- ML for dynamic decisions: reinforcment learning
- All about privacy

In particular, check out [Advanced SDS I](https://kurser.ku.dk/course/asdk20004u/2020-2021) and [Advanced SDS II](https://kurser.ku.dk/course/asdk20006u/2020-2021). Also sometimes possible to write a project (BA, MSc, [seminar](https://kurser.ku.dk/course/a%c3%98kk08411u/2020-2021)).

# Learning How to Code

## Not a Free Lunch

This course.. ain't easy..

But at least you get a lot of help!

Learning without supervision, you may struggle with simple stuff...

- - -

<center><img src='https://media.giphy.com/media/h36vh423PiV9K/giphy.gif' alt="Drawing" style="width: 400px;"/></center>


## Some Encouragement

#### Hadley Wickham

> The bad news is that when ever you learn a new skill **you’re going to suck**. It’s going to be **frustrating**. The good news is that is typical and **happens to everyone** and it is **only temporary**. You can’t go from knowing nothing to becoming an expert without going through a **period of great frustration** and great suckiness.


#### Kosuke Imai

> One can learn data analysis **only by doing**, not by reading.

## Light at the End..

Why would you go through this pain? You choose one of two paths after this course...

i. You move on, you forget some or most of the material.

ii. You are lit and your life has changed. 
- You may return to become a better sociologist, anthropologit, economist etc.
- Or, you may continue along the new track of data science.
- In any case, you keep learning and expanding your programming skills.

## Advice for Coding

Three pieces of advice that will take you far!

1. Be careful: Think before you code $-$ what you are trying to make it do?

2. Be lazy: Reuse code and write Reusable code (e.g. functions)

3. Make understandable: Think about audience!
    1. Future you? May not recall this at all.    
    1. Group members or world? May not understand!
    1. Write lots of comments and potentially background explanation/documentation
    


## Advice for Learning to Code

- Maintain healthy curiousity  $-$  how could we do things better?
- Practice and try as much as possible
- Type the code in yourself  $-$ then you see what is going on.


# Python vs. R (vs. Julia)

## Tradeoff?

- Each language has their own advantages and similarities (my opinion)

|                     | Python | R | Julia |
|---------------------|--------|---|---|
| Data structuring    | X      | X | |
| Plotting            | X      | X | |
| Machine learning    | X      |   | |
| Statistics          |        | X | |
| General programming and modelling | X      |   | (X)|
| Ease of learning    | X      |   | |


- Tools are increasingly integrated
    - Jupyter a shared framework for data science 
    - New software allows direct execution side-by-side: use R within python (vice versa)
    - New tools becoming available across languages, e.g. data processing engine (arrow)
    


Advice: don't worry. It's likely you need to learn more than one language.

# Python $-$ Advanced Concepts and Useful Tips


## Making Code Reusable


### Functions
Be careful of how you define input/outputs and how objects are created. In particular:

- Globals: Objects defined outside function
    - Note: These are available both locally and globally (although, they can be overwritten)
- Locals: Objects that are created within a sub-level. 
    - Example: objects defined inside a function
    - Cannot use outside, unless we use `return`

### Checklist for Functions
Some key questions that you should often ask yourself:
- Should I write a function?
    - If you are repeating some process: Yes!
    - Makes code easier to write (better overview, less mistakes) and read
- If yes:
    - What do you intend the function to output? Is the correct output returned?
    - Do you use locals where possible? 
    - Are globals assigned before you define the function in notebook/script?
        - (if not, your function may fail! Reason: some used globals are not defined)


### Classes 
Where do objects come from?

- From a collection of pre-defined attributes and methods $-$ this is called a `class`.
- We can make our own. It is complex, but powerful.
- Examples of classes/objects we will see: 
    - Dataframes, scraping tools, machine learning models etc.
    - (everything is an object)

## Copy vs. View

Important, when writing code `A = B`, then `A` is only a reference to `B`!

- In other words: `A` is a **view** of `B`
- Implication 
    - if `A` is mutable, e.g. list, dataframe: changes to `B` shows up in `A` and vice versa 

We can break this dependency by explicitly making **copy**. 
- For instance, in pandas use `A = B.copy()` method



## Coding That is Fast

### One-liners using comprehensions

A very compact way of writing code: *comprehension*. Example of list comprehension:
```python
new_list = [my_proc(a) for a in my_list] 
```

The new thing is that we define loop inside the list! 


In [25]:
# Define function
def my_proc(a):
    return a**2

# Generate a list
my_list = []
for i in range(10):
    my_list.append(i+1)

# Use line from before to generate transformation 
new_list = [my_proc(a) for a in my_list]

# Print lists
print('my_list:', my_list)
print('new_list:', new_list)

my_list: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
new_list: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


### The Need for Speed
Python is elegant and simple

However, Python is NOT built for speed. 

We can compensate by using smart packages that have fast algorithms, e.g. numpy and pandas


### Vectorization

An alternative is to write our code in terms of numpy arrays.

In example below, we generate a long list by making some transformation of an input... we get around 30 times speed-up by using numpy!

In [41]:
%%timeit 
[i+3 for i in range(10**7)]

1.07 s ± 98.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [42]:
import numpy as np

In [43]:
%%timeit
np.arange(10**7)+3

32.2 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Take Away on Speed

Use pandas or numpy - they are optimized for speed! 

Why are pandas and numpy fast? 

- Because they are written in fast, low-level code with optimized algorithms! 

Why are we not learning low level? 
- Requires too much space for simple operations - not efficient!
- Too steep learning curve (not like Python!!)

# Power tools: git and markdown

## Git, a non-technical overview

Git is a tool for command line:

1) "Track changes" system for files
- A log of all changes is kept - from nothing to current version
- All changes are explicitly declared by you, may annotate
  - You can try out things, but only save meaningful changes!

2) Share the files you want, how you want 
- A git folder, called **repository**, can be copied by others
- Many sites allow public and private repositories - you decide access
    


Not covered explicitly in course. Learn to use point-and-click version for fetching files.

## Markdown

[LaTeX](https://en.wikipedia.org/wiki/LaTeX) can create beautiful scientific documents.

Problem - background code is heavy to read. Is there alternative?

Yes: markdown. Like python, keeps code simple. Example:
        
Making italic text:
- markdown: `*Some text*`
- LaTeX: `\textit{Some text}`

How do you learn it? Open our notebook cells or see tutorial in reading lsit

# Outro

- Coding is tough, but worth learning
    - Also for social scientists
- You can dig deeper into
   - Advanced python
   - Git for saving and sharing code
   
- Next lecture this afternoon: Strings, queries and APIs