# Python Bootcamp (in-class)

## Learning Objectives

Students will be able to: 
* Locally launch Jupyter Notebook and Spyder
* Know when it's useful to use Google Colab in the cloud
* Understand basic programming concepts and their applications to collecting web data

<div class="alert alert-block alert-info"><b>Support Needed?</b> 
    For technical issues outside of scheduled classes, please check the <a href="https://odcm.hannesdatta.com/docs/course/support" target="_blank">support section</a> on the course website.
</div>

------

## 1. Welcome to Python!

### 1.1 Why we use Python...
1. Multi-purpose (web server, web scraping, automating, machine learning)
2. High-level, relatively easy to learn
3. Great documentation
4. Open source / free
5. Widely used in business/data science
6. Platform independent

### 1.2 Differences between Python, `anaconda`, and  `JupyterNotebook`, the cloud...

- Python vs. "Notebooks" (e.g., running code)
- Self-installed Python distribution vs. Anaconda (e.g., getting it "right")
- Local vs. cloud setups (e.g., installing packages)

-------------

## 2. Launching Python and getting to know the interface

### 2.1 Launching Jupyter Notebook locally

- Anaconda Navigator
- Command prompt/terminal
- Why Jupyter Notebook runs in your browser (and also in your terminal)
- Finding and opening `.ipynb` files
- Closing Jupyter Notebook

<div class="alert alert-block alert-info"><b>Tip:</b> 
The terms command prompt (Windows) and Terminal (Mac, Linux) are used interchangeably.
</div>


__Exercise 2.1__

Download this notebook from the course website as a `.ipynb` file, save it on your computer, launch Jupyter Notebook on your computer, and open the file.


### 2.2 Getting to know the Jupyter Notebook interface
- Code vs. markdown cells
- Running cells
- Markdown highlighting


__Exercise 2.2__

- Open a new Jupyter Notebook and create content.
- First, add a markdown cell, in which you format "Exercises" as a first-order title using markdown (using `#`), followed by your name and email address as regular text.
- Second, add a code cell, in which you type `message = "Hello world"`.
- Third, add another code cell, in which you type `print(message)`.
- Run all cells.
- Save the notebook as `my_exercise.ipynb`.

### 2.3 Launching Google Colab

- Discuss benefits and drawbacks of cloud-based Jupyter Notebooks (e.g., use of `selenium` and `Chrome`)
- [Launching Google Colab](https://colab.research.google.com)
- Connecting Google Colab to your Google Drive

__Exercise 2.3__
- Open `my_exercise.ipynb` in Google Colab, run it, and create a collaborative sharing link.


### 2.4 Launching a Python editor (e.g., Spyder)
- Source code (vs. code cells)
- Comments (vs. markdown cells)
- File extensions (`.ipynb` vs. `.py`)
- Alternative editors (e.g., VS Code)
- Launching `.py` files "in production" from the terminal: `python your_sourcecode.py`

__Exercise 2.4__

- Create the following `code.py` file, with the following content:
    
    ```
    message = "Hello world"
    print(message)
    ```

- Run the file from the terminal.

### 2.5 Installing packages and using help on the web

- Installing packages __via the terminal__: `pip install <packagename>`
- Importing packages into Python __via Python__: `import <packagename>`

__Exercise 2.5__

Pandas is a really popular package for working with data in Python. Can you quickly search the web for *how to install it*, and then actually install it on your computer and test whether it runs?


<div class="alert alert-block alert-info"><b>Tip:</b> 
I frequently use <a href="https://stackoverflow.com/questions/">Stackoverflow</a> when I'm stuck with Python.
</div>



### 2.6 Why Jupyter Notebook sucks (but still we use it...)

- Danger of point-and-click (and benefits of top-down execution)
- High overhead (vs. lean `.py` files)
- Support of advanced tutorials and packages (limited `selenium` support)

<div class="alert alert-block alert-info"><b>Tip:</b> 
You can mimick top-down execution in Jupyter Notebook by restarting the kernel (Kernel --> Restart), and executing your cells (Cells --> Run all).
</div>

Ultimately, the benefits of Jupyter Notebook for education outweigh the drawbacks.

---------

## 3. Coding concepts for web data



### 3.1 The web data workflow

1. Select data sources
2. Design data collection
    - Import data from the web into Python ("how to import web data into Python?")
    - Select relevant data from raw HTML files or the output of APIs ("how to select and filter data in Python?")
    - Store data in tables or databases ("how to store data using Python?")
3. Execute data collection
    - Schedule the data extraction and monitor its health ("how to schedule Python scripts?")
    
    
<div class="alert alert-block alert-info"><b>Tip:</b> 
Curious about the "real" web data workflow (which is much more comprehensive then what is here? Start getting familiar with it early on by reading <a href="https://journals.sagepub.com/doi/10.1177/00222429221100750">"Fields of Gold"</a> (you've got to know this paper inside out by the end of this course...).
</div>



### 3.2 Variable types: Strings and Numbers
- strings (`message = 'Hello world!'` and `message2 = "This is a tutorial!"`) vs. numbers (`age = 25`)
- joining/concatenating strings (`message + message2`)
- calculating with numbers (`age + 1`)
- joining strings and numbers (+ conversion) (`message + str(age)`)
- printing numbers and strings

__Exercise 3.1__

Please write some Python code that stores your name in a variable called `name`, and your age in a variable called `age`. Then, print the following to the screen: "My name is <YOUR NAME> and I am <AGE> years old.".

### 3.3 Crawling data from `reddit.com`



__Exercise #3.1__

- Open Reddit.com and find the "University" subreddit - browse it.
- Copy the following code to Jupyter Notebook, and execute it.
- Then, change the subreddit to one of your choice (i.e., search for something you find interesting on [Reddit.com](https://reddit.com), and rerun the cell.


In [31]:
import requests
import json
subreddit = 'University'
url = 'https://www.reddit.com/r/' + subreddit + '/about.json'
content = requests.get(url, headers = {'User-agent': 'I am learning Python.'}).json()
print('Getting data from...', url)
print('Subreddit name:', content['data']['display_name'])
print('Subreddit title:', content['data']['title'])

Getting data from... https://www.reddit.com/r/University/about.json
Subreddit name: University
Subreddit title: University: academic and real-world news for students, faculty, and academics


### 3.4 Reusing code with functions

- Writing functions can drastically simplify code execution (avoiding copy-paste errors)
- Functions start with `dev`, and can (but not need to) have inputs ("arguments") and outputs ("returns")
- Functions require a "hierarchy", as visualized with indents (same number of spaces or a tab)

    
<div class="alert alert-block alert-info"><b>Tip:</b> 
Most novices don't get the indents right. Especially when copy-pasting code from the web, you end up with an inconsistent number of spaces and/or tabs, while Python requires you to always use the same number of spaces or tabs. Be aware of this bottleneck when coding!
</div>

__Exercise 3.2__

- Copy the following function into your Jupyter Notebook, and run the cell.

In [30]:
def get_reddit_data(subreddit):
    url = 'https://www.reddit.com/r/' + subreddit + '/about.json'
    content = requests.get(url, headers = {'User-agent': 'I am learning Python.'}).json()
    print('Getting data from...', url)
    print('Subreddit name:', content['data']['display_name'])
    print('Subreddit title:', content['data']['title'])

- Then, run the function, by writing `get_reddit_data("university")` in a new cell and running it. Does it work?
- Finally, call this function for three subreddits of your choice (and write the code for it & run it!).
- What are the inputs and outputs of the `get_reddit_data` function?

### 3.5 Returning and saving data

- using `return` in a function (simplify to only reddit's `display_name`)
- saving data in a variable
- choosing variable names
- the dictionary data type (attribute-value pairs, `{'name': 'student', 'age': 25}`

__Exercise 3.3__

Please modify the `get_reddit_data` function to return a dictionary, holding the `display_name` and `title`, and store the dictionary in a variable called `output`.

### 3.6 Arrays and looping
- In web data, you typically want to get data on MORE than one page/endpoint (e.g., subreddit).
- Copy-pasting -- even simple function calls -- is to be avoided (copy-paste mistakes, code complexity)
- Arrays can hold multiple records of information, e.g., subreddits `subreddits = ['university','marketing','sports']`
- The process of "iterating" through an array is called a LOOP. Combining these two makes a powerful pair!!!

In [33]:
subreddits = ['university','marketing','sports']

for item in subreddits:
    print(item)

university
marketing
sports


__Exercise 3.4__

- Modify the code-snippet above, to repeatedly execute the get_reddit_data() function on the subreddits.
- Test the function --- does it work? Where does the output go?


### 3.7 Saving data in "flat files"

- The file writing process (open, write, close)
- Encoding
- Different writing modes (`'a'` for appending, `'w'` for creating new/overwriting existing file)
- Choosing a file extension

In [39]:
f = open('filename.json', 'w', encoding = 'utf-8') # open new file for writing
f.write('Hello world!\n') # write content and NEW LINE character to file ('\n')
f.close() # close file

__Exercise 3.5__

Run the code (`f = open`... from the cell above) in Jupyter Notebook. Then, create a NEW cell below it, and write some code that appends `"I am working on a tutorial"` to this file.


### 3.8 Storing dictionaries in "flat files"

- Dictionaries as new-line separated JSON files
- Converting dictionaries to flat-files: the `json` library
- Where to load packages in a script?

In [42]:
my_datapoint = {'name': 'student', 'age': 25}

import json

json.dumps(my_datapoint)

'{"name": "student", "age": 25}'

__Exercise 3.6__

Please write some code that *saves* the variable content of `my_datapoint` to a flat `json` file, called `my_data.json`.

### 3.9 Tying things together

Now it's your turn. Use the concepts from above to...

- Create an array, holding ten subreddit names of your choice
- Write a function that returns as a dictionary the following data points from the about page of a subreddit: `display_name`, `title`, `subscribers`, and the date of creation, `created`.
- Write a loop to retrieve data for the ten subreddits, and store the data in a new-line separated JSON file called `my_first_web_data.json`.

<div class="alert alert-block alert-info"><b>Tips:</b>
    
<ul>
    <li>Did you know you can "look" at the API output directly in Firefox or Chrome? Just open the URL that is called for a particular subreddit in your browser. Try it with <a href='https://www.reddit.com/r/University/about.json'>this one first (click)!</a></li>
  <li>You can use <code>f.write</code> multiple times in your code. To write a new line to the file, use <code>f.write('\n')</code>.</li>
    <li>Please pay attention to where you open the file for the first time, and how (</code>'a'</code> vs. <code>'w'</code></li>
  
</ul> 
 
</div>



__That's it! Good job!__