# Learning Lunch: Python for Natural Language Processing

# Lesson 1: The Basics 

## Introduction: Yay! Coding!

### This class (and hopefully others!) will be split in 2 parts. 

1. **Concept time** - Following along with pre-written code

    * Lesson 1: Imports, variables, datatypes, if/else, functions




2. **Project time** - The best way to learn coding? By building something cool!

    * Project 1: Band name generator

    We'll write a Python program to create a Band Name Generator (one of these bad boys)!

    ![Example](indie_band.png)

    ![Example](midwest_emo_band.png)

    

---
**Notes:**
* I will include starter code
    
* You can work collaboratively, solo, or skip it if it's not your style.

* You can continue working on projects whenever you want, even if we don't finish today.

---

## 1.1 - Setup 

### Creating a virtual environment

* Open Terminal (I prefer iTerm, but either is fine) and run the following commands. 


    ```python
    cd downloads/python_learning_lunches 
    conda env create -f environment.yml
    conda activate python-ll-env
    ```

* You should now see ```(python-ll-env)``` before your username.

---
**Notes:**

* We are using **Conda** to install and manage Python packages. More details [here](https://docs.conda.io).

* `environment.yml` is our *environment file*. Learn more about Conda environment files [here](https://carpentries-incubator.github.io/introduction-to-conda-for-data-scientists/04-sharing-environments/index.html).

---

### Notebook Shortcuts (Mac)
* Run cell: **Shift + Return**
* New cell: **b**
* Formatting: **Settings > Autoclose brackets**

##### Run the cell below, selecting `python-ll-env`as your environment if prompted:


In [1]:
print("Environment setup complete!")

Environment setup complete!


### Imports

#### For convenience, certain **packages** have already been installed. 

This means we can simply *import* the package to use it.

In [1]:
import pandas as pd

# What this means: Import the library "pandas" under an alias "pd"

What happens if we try to import a package that isn't already installed in our environment?

In [1]:
import polyglot

ModuleNotFoundError: No module named 'polyglot'

We get an error since `polyglot` is missing from the packages currently installed in our environment. 

To fix this, we can install the missing library with `!pip install <PACKAGE_NAME>`.


In [2]:
!pip install polyglot

Collecting polyglot
  Using cached polyglot-16.7.4-py2.py3-none-any.whl
Installing collected packages: polyglot
Successfully installed polyglot-16.7.4


---
**Note:** 

* Since we're working in a Conda environment, it's best practice to install packages with `!conda install [OPTIONS] <PACKAGE_NAME>` unless `pip install` is absolutely necessary. Pip is a bit faster, so I'm using it to demonstrate.

---

To check that the package was installed, we run the `import` statement again to see if it throws us an error.

In [3]:
import polyglot

It worked! Before we can move on to the fun stuff, let's import a few more things.

### Other Imports

In [5]:

import re
from utils.functions import *


---

## 1.2: Crash course on Python variables and data types 

We'll start by defining two variables:

1. `first_superbowl_winner`: The winning team of the first ever Super Bowl, in 1966. 

2. `superbowl_winner`: A Super Bowl-winning team.

In [7]:
### Don't do this:
# x = ""

### Do this:
first_superbowl_winner = "Green Bay Packers"
superbowl_winner = None

The first Super Bowl, in 1966, was won by the Green Bay Packers. 

However...

`superbowl_winner` can refer to 20 unique teams (at the time of writing), depending on the year. 

Since we don't have a definition for "year" yet, let's assign a value of `None` to `superbowl_winner` for now.

---

### Data types

(It's a good idea to familiarize yourself with Python's built-in data types - [the offical documentation](https://docs.python.org/3/library/stdtypes.html) is a good place to start.)

Let's check the `type` of our variables using `formatted_type()`, a function I wrote to make it a bit easier on the eyes & save us some time.

In [8]:
formatted_type(superbowl_winner)

Input: None
 
Data type: NoneType


In [9]:
formatted_type(first_superbowl_winner)

Input: Green Bay Packers
 
Data type: str


---
**FYI:**

Under the hood, `formatted_type`:
1. Takes an input 
2. check its data type with the built-in function `type()`
3. Outputs the result with the built-in-function`print()`

---

### To recap...
* `NoneType`s in Python pretty much just means "no value". 

* `str` is short for `string`: a string of characters surrounded by open and closed quotations.

    * Even though they can get complicated, for today we can think of them as *text*: sentences, words, characters...

* We work with strings a lot, so we'll go over some ways to slice and dice them in a bit.

### ### (Time permitting) Ok, so what?

One thing to keep in mind: it evaluates to the **boolean** `False` in an `if/else` statement.

Let's use the function below to see this in action.

In [9]:
def who_won(superbowl_winner):
    if superbowl_winner:    # if superbowl_winner is has a value
        print(f"{superbowl_winner} won the super bowl!") # <- do this
    else: # if it doesn't
        print("Wait, what year is it?") #<- do this instead

In [10]:
superbowl_winner = None 
who_won(superbowl_winner) # None has a value = False

Wait, what year is it?


#### **Vibe check**: Let's make a small change by putting quotation marks around `None`. 

What will the output be now?

In [11]:
superbowl_winner = "None"
#who_won(superbowl_winner)

In [10]:
superbowl_winner_type = type(superbowl_winner).__name__
print(superbowl_winner_type)

### ### (end time permitting)

## 1.3 - Strings

#### Strings are sequences - *strings* - of character data enclosed by either double or single quotes. 


In [51]:
# Takes user input, saves as variable
MYNAME = input("My name is: ")

In [46]:

#Concactenation
print("My name is " + MYNAME)

#F-string
print(f"My name is {MYNAME}")

#Slicing
print("The first letter of my name is " + MYNAME[0])

print("My name backwards is " + MYNAME[::-1])

#Length
print(f"My name has {len(MYNAME)} letters")

My name is michelle
My name is michelle
The first letter of my name is m
My name backwards is ellehcim
My name has 8 letters


--- 
**Note:**

There are many other useful built-in string methods. Check out more examples [here](https://www.pythoncheatsheet.org/cheatsheet/manipulating-strings).

---

#### Strings are immutable, meaning they cannot be changed once created. 

In [48]:
MYNAME[0] = "M"

TypeError: 'str' object does not support item assignment

In [49]:
MYNAME = "M" + MYNAME[1:]
print(MYNAME)

# Built-in method for this as well
print(f"My name is {MYNAME.capitalize()}")

Michelle
My name is Michelle


In [50]:
# Regex
vowels = re.compile("[aeiou]")
print(f"My name without vowels is {re.sub(vowels, '', MYNAME)}")

My name without vowels is Mchll


### Converting data types to strings

In [33]:
def birth_month(month):
    return "My birth month is " + month

In [34]:
MONTH = 1

birth_month(1)

TypeError: can only concatenate str (not "int") to str

In [36]:
# Cast to string 

MONTH = str(MONTH)
birth_month(MONTH)

'My birth month is 26'

### An Aside: Functions

#### A very basic function:

In [57]:
def say_hi(name: str) -> str:    
    '''
    Appends the string "hi" to a name.
    '''
    greeting = f"Hi, {name}!"
    return greeting

#### Let's walk through it line-by-line!

* Line 1: `def say_hi(name: str) -> str:`

    * (Most) functions start with the keyword `def`, short for "define".

    * `say_hi` is the function name. 

    * `\(name: str\)-> str` - `name` is the parameter and `str` is its expected type. The `type` after the arrow ->  indicates the expected output variable.

        * This is called *type hinting* - while it's not mandatory, it is good practice to avoid problems when your code gets more complex.
        
* Lines 2-4: 

    * This is a docstring: a bit of text that tells others (and sometimes yourself) what the function is meant to do.


* Line 5: `greeting = f"Hi, {name}!"`

    * This is the function body, where the magic happens.
    
    * In this case, it uses an *f-string* (more on this later) to format a greeting ("Hi, <name>!")

* Line 6: `return greeting`

    * The `return` keyword means the function is actually outputting a value - something we can store and manipulate.

    *  Question: what if we changed `return` to `print()`?


---

### Your turn! 

In [56]:
def greet_user(greeting: str) -> str:
    '''
    Prompts user for first and last names, then greets user by their full name. 
    
    params:
    ------
    greeting: str
    The greeting to use, e.g. 'What's up'.
    
    Example:
    --------

    fname = 'michelle'
    lname = 'yun'
    
    greet_user('What's up') should return 'What's up Michelle Yun'
    '''
    #Change these
    fname = ... 
    lname = ... 
    
    # Your code here
    
    

## 1.4 - Iterables & For Loops

Iterables are objects that you can iterate - or loop - through. Let's look at the most useful/common types.

### Lists

In [26]:
shopping_list = ["milk", "apples", "cereal", "juice"]

What can we do with lists?

In [25]:
"milk" in shopping_list

True

In [21]:
len(shopping_list)

4

In [22]:
shopping_list[2]

'cereal'

### How does looping work?

In [31]:
new_shopping_list = ["ham", "milk"]
print(new_shopping_list)

for item in shopping_list:
    if item not in new_shopping_list:
        new_shopping_list.append(item)
        
print(new_shopping_list)
        

['ham', 'milk']
['ham', 'milk', 'apples', 'cereal', 'juice']


What is the datatype of `pastfive_sb_winners`?

Find it, then assign it to the variable `pastfive_sb_winners_type` (hint: use `type()`). 

In [13]:
pastfive_sb_winners = {
    "2023": "Kansas City Chiefs",
    "2022": "Los Angeles Rams",
    "2021": "Tampa Bay Buccaneers",
    "2020": "Kansas City Chiefs",
    "2019": "New England Patriots",
    "2018": "Philadelphia Eagles",
    "2017": "New England Patriots",
    "2016": "Denver Broncos"    
}

### Your code here
pastfive_sb_winners_type = ... # change this

## Throws an error if pastfive_sb_winners_type is incorrect

assert pastfive_sb_winners_type == Solutions.q1, "Try again"

In [58]:
#years_list = ['2023', '2022', '2021', '2020', '2019', '2018', '2017', '2016']


---

### Python dictionaries represent key-value pairs.

* In `pastfive_sb_winners`, the **keys** are years and the **values** are team names.

* Each year maps to the winning team the Superbowl of the given year.

**Note**: Dictionary keys cannot be duplicates, and they have other important qualities like immutability. 

* We'll go more into this later.

* For now, the most important thing to know is that you can extract a value from a `dict` using its key.

* We'll be doing this to build our Band Name Generator!



In [15]:
pastfive_sb_winners["2023"]

# If the key doesn't exist???
# pastfive_sb_winners["2025"]


'Kansas City Chiefs'

Who won the Superbowl in 2017? Assign the answer to `sb_winner_2017` and run the cell to check.

In [18]:
## Your code here
sb_winner_2017 = ...

assert sb_winner_2017 == Solutions.q2, "Try again"

---

### Your turn!

We now have enough information to write a piece of code that:

* Takes an input `year` (Question: are there any constraints to consider?)

* Returns the value for `superbowl_winner` of that `year` using `pastfive_sb_winners`



In other words, we're mapping a parameter *year* (what goes in) to a value *superbowl_winner* (what comes out).

* That's pretty much the gist of what a function does!


In [20]:
# Our dictionary
SB_DICT = pastfive_sb_winners
print(SB_DICT)

{'2023': 'Kansas City Chiefs', '2022': 'Los Angeles Rams', '2021': 'Tampa Bay Buccaneers', '2020': 'Kansas City Chiefs', '2019': 'New England Patriots', '2018': 'Philadelphia Eagles', '2017': 'New England Patriots', '2016': 'Denver Broncos'}


In [None]:

def who_won_the_sb(year:...) -> str:    # what is the expected dtype of year?
    ### Your code here
    pass
    
