# Learning Lunch: Python for Natural Language Processing

# Lesson 1: The Basics 

## Introduction: Yay! Coding!

### This class (and hopefully others!) will be split in 2 parts. 

1. **Concept time** - Following along with pre-written code

    * Lesson 1: Imports, variables, datatypes, if/else, functions




2. **Project time** - The best way to learn coding? By building something cool!

    * Project 1: Band name generator

    We'll write a Python program to create a Band Name Generator (one of these bad boys)!

    ![Example](indie_band.png)

    ![Example](midwest_emo_band.png)

    Storytelling program is an alternative, could be more challenging? 

    * I will include starter code
    
    * You can work collaboratively, solo, or skip it if it's not your style.

    * You can continue working on projects whenever you want, even if we don't finish today.
    

---

## 1.1 - Setup 

### Before we start:
* We will be working from *notebooks* where we can write and execute chunks of code.

* Conda will help keep our Python packages from getting messy.

* Check out `environment.yml` for more info!

    #### Notebook Tips (Mac):
    * Run cell: **Shift + Return**
    * New cell: **b**
    * Formatting: **Settings > Autoclose brackets**

### Imports

#### To save time, certain **packages** have already been installed. 

This means we can simply *import* the package to use it.

In [1]:
import pandas as pd

# What this means: Import the library "pandas" under an alias "pd"

What happens if we try to import a package that isn't already installed in our environment?

In [6]:
import polyglot

ModuleNotFoundError: No module named 'polyglot'

We get an error since `polyglot` is missing from the packages currently installed in our environment. 

To fix this, we can install the missing library with `!pip install <PACKAGE_NAME>`.


(**Note**: Since we're working in a Conda environment, it's best practice to install packages with `!conda install [OPTIONS] <PACKAGE_NAME>` unless `pip install` is absolutely necessary. 

[(Learn more about Conda)](https://docs.conda.io)

In [7]:
!pip install polyglot

Collecting polyglot
  Downloading polyglot-16.7.4.tar.gz (126 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.3/126.3 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m[31m1.4 MB/s[0m eta [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: polyglot
  Building wheel for polyglot (setup.py) ... [?25ldone
[?25h  Created wheel for polyglot: filename=polyglot-16.7.4-py2.py3-none-any.whl size=52558 sha256=494eb2941f42c209319395a19968e5f142d3eb91c2557b808dbe13921ab7acf0
  Stored in directory: /Users/yunmichelle/Library/Caches/pip/wheels/c7/5e/28/47349211ec1f91379f41ed10bc2520f7071ecfb6cbe182f6fe
Successfully built polyglot
Installing collected packages: polyglot
Successfully installed polyglot-16.7.4


To check that the package was installed, we run the `import` statement again to see if it throws us an error.

In [17]:
import polyglot

It worked! Before we can move on to the fun stuff, let's import a few more things.

### Other Imports

In [4]:
import nltk
import re
import csv
import emoji
from collections import defaultdict
from scripts import helpers as h


**NLTK** is a popular and widely used tool for NLP with Python. 

In the cell below, we'll download the packages used in the [NLTK book](https://www.nltk.org/book).

In [5]:
#nltk.download('book')

---

## 1.2: Crash course on Python variables and data types 

We'll start by defining two variables:

1. `first_superbowl_winner`: The winning team of the first ever Super Bowl, in 1966. 

2. `superbowl_winner`: A Super Bowl-winning team.

In [6]:
### Don't do this:
# x = ""

### Do this:
first_superbowl_winner = "Green Bay Packers"
superbowl_winner = None

We know who won the 1966 Super Bowl, but...

`superbowl_winner` can refer to 20 unique teams (at the time of writing), depending on the year. 

Since we don't have a definition for "year" yet, we will set the value of `superbowl_winner` to `None`.

---

Let's check the Type of our variables using `formatted_type()`, a function I wrote to make it a bit easier on the eyes & save us some time.

In [7]:
h.formatted_type(superbowl_winner)

Input: None
 
Data type: NoneType


In [8]:
h.formatted_type(first_superbowl_winner)

Input: Green Bay Packers
 
Data type: str


---
##### FYI:
Under the hood, `formatted_type`:
1. Takes an input 
2. check its data type with the built-in function `type()`
3. Outputs the result with the built-in-function`print()`

---

### To recap...
* `NoneType`s in Python pretty much just means "no value". 

* `str` is short for `string`: a string of characters surrounded by open and closed quotations.

    * Even though they can get complicated, for today we can think of them as *text*: sentences, words, characters...

We work with strings a lot, so we'll go over some ways to slice and dice them in a bit.

### ### (Time permitting) Ok, so what?

One thing to keep in mind: it evaluates to the **boolean** `False` in an `if/else` statement.

Let's use the function below to see this in action.

In [9]:
def who_won(superbowl_winner):
    if superbowl_winner:    # if superbowl_winner is has a value
        print(f"{superbowl_winner} won the super bowl!") # <- do this
    else: # if it doesn't
        print("Wait, what year is it?") #<- do this instead

In [10]:
superbowl_winner = None 
who_won(superbowl_winner) # None has a value = False

Wait, what year is it?


#### **Vibe check**: Let's make a small change by putting quotation marks around `None`. 

What will the output be now?

In [11]:
superbowl_winner = "None"
#who_won(superbowl_winner)

In [12]:
superbowl_winner_type = type(superbowl_winner).__name__
print(superbowl_winner_type)

str


### ### (end time permitting)

## 1.3 - Dictionaries & Functions

In [13]:
import typing
from scripts.helpers import Solutions

What is the datatype of `pastfive_sb_winners`?

Find it, then assign it to the variable `pastfive_sb_winners_type` (hint: use `type()`). 

In [17]:
pastfive_sb_winners = {
    "2023": "Kansas City Chiefs",
    "2022": "Los Angeles Rams",
    "2021": "Tampa Bay Buccaneers",
    "2020": "Kansas City Chiefs",
    "2019": "New England Patriots",
    "2018": "Philadelphia Eagles",
    "2017": "New England Patriots",
    "2016": "Denver Broncos"    
}

### Your code here
pastfive_sb_winners_type = ... # change this

## Throws an error if pastfive_sb_winners_type is incorrect

assert pastfive_sb_winners_type == Solutions.q1, "Try again"

---

### Python dictionaries represent key-value pairs.

* In `pastfive_sb_winners`, the **keys** are years and the **values** are team names.

* Each year maps to the winning team the Superbowl of the given year.

**Note**: Dictionary keys cannot be duplicates, and they have other important qualities like immutability. 

* We'll go more into this later.

* For now, the most important thing to know is that you can extract a value from a `dict` using its key.

* We'll be doing this to build our Band Name Generator!



In [15]:
year = "2023"
pastfive_sb_winners[year]

# If the key doesn't exist???
# pastfive_sb_winners["2025"]


'Kansas City Chiefs'

Who won the Superbowl in 2017? Assign the answer to `sb_winner_2017` and run the cell to check.

In [18]:
## Your code here
sb_winner_2017 = ...

assert sb_winner_2017 == Solutions.q2, "Try again"

#### Question: What else could be represented by a `dict`?
- What you have in your fridge? [y/n]

- Countries you have and haven't travelled to? [y/n]

---

### Functions! 

#### A very basic Python function:

In [19]:
def say_hi(name: str) -> str:    
    '''
    Appends the string "hi" to a name.
    '''
    greeting = f"Hi, {name}!"
    return greeting

#### Let's walk through it line-by-line!

* Line 1: `def say_hi(name: str) -> str:`

    * (Most) functions start with the keyword `def`, short for "define".

    * `say_hi` is the function name. 

    * `\(name: str\)-> str` - `name` is the parameter and `str` is its expected type. The `type` after the arrow ->  indicates the expected output variable.

        * This is called *type hinting* - while it's not mandatory, it is good practice to avoid problems when your code gets more complex.
        
* Lines 2-4: 

    * This is a docstring: a bit of text that tells others (and sometimes yourself) what the function is meant to do.


* Line 5: `greeting = f"Hi, {name}!"`

    * This is the function body, where the magic happens.
    
    * In this case, it uses an *f-string* (more on this later) to format a greeting ("Hi, <name>!")

* Line 6: `return greeting`

    * The `return` keyword means the function is actually outputting a value - something we can store and manipulate.

    *  Question: what if we changed `return` to `print()`?


---

### Functions: Your turn!

We now have enough information to write a piece of code that:

* Takes an input `year` (Question: are there any constraints to consider?)

* Returns the value for `superbowl_winner` of that `year` using `pastfive_sb_winners`



In other words, we're mapping a parameter *year* (what goes in) to a value *superbowl_winner* (what comes out).

* That's pretty much the gist of what a function does!


In [20]:
# Our dictionary
SB_DICT = pastfive_sb_winners
print(SB_DICT)

{'2023': 'Kansas City Chiefs', '2022': 'Los Angeles Rams', '2021': 'Tampa Bay Buccaneers', '2020': 'Kansas City Chiefs', '2019': 'New England Patriots', '2018': 'Philadelphia Eagles', '2017': 'New England Patriots', '2016': 'Denver Broncos'}


In [None]:

def who_won_the_sb(year:...) -> str:    # what is the expected dtype of year?
    ### Your code here
    pass
    


## 1.4 - String Methods & Project Time
There are also methods that only work on `string` objects. We'll go over some now.