# Lab 1: Introduction

In this lab, we will focus on some introductory aspects of Python and this computing environment. 

Python is a powerful language to learn first because (in my opinion, of course) it has a very approachable syntax. Once you become familiar with the syntax, Python can begin to read like normal prose, and you will become very comfortable writing code. It is also a powerful language to start out with, because it is really ubiquitous across technology teams spanning engineering, data analysis and data science workflows.

In this tutorial, we are using a Jupyter Notebook. Fun fact - Jupyter is an acronym for the first three languages supported ( _Ju_ lia, _Pyt_ hon, and _R_). A Noteebook consists of cells which can be executed using `shift + enter` or by clicking the execute button on the top right of a cell.

In [2]:
# This is a comment which starts with the pound or hashtag symbol
# execute this cell by typing shit + enter

# create a variable a and assign the value 1 to it
a = 1
a

1

In [3]:
# basic arithmetic: 
a = 1
b = 2
c = 3

ans = a * b * c
print(ans)

6


# Python Objects

## Lists

In [4]:
# lists are a collection of objects. They can be either text or numbers
# they are declared with hard brackets, and items are separated by commas
a = [1, 2, 3]
b = ['Hello', 'World']
c = ['Magic Number', 42]

# to access, enter the index. Note Python starts counting at 0
c[1]

42

## Dictionaries

In [5]:
# dictionaries (or hash tables) contain a mapping of one data element to another.
# for example: someone's weight could be 165 pounds, and their height is 6 feet.
# we separate the key which is the descriptor (e.g. 'Weight') to the value (e.g. 165)
# using the colon symbol `:`
data = {'Weight': 165, 'Height': 6}

# retrieve a data point using the key value
data['Weight']

165

## Functions

Functions are useful because they help organize reusible code segments. These are a popular set of design principles, namely Do Not Repeat yourself (i.e. the DRY principles). If we define code once, we can update it in one location, rather than many cells.

A function is declared using `def`. We name the function, and add any input arguments in paranthesis. Finally, we type a semicolon to show the end of the function call, and we can start typing the code. At the end of the function, we can `return` data back to the user if we wish.

In [6]:
def add(a, b):
    c = a + b
    return c

add(1, 2)

3

To improve this function call, we use the mypy typing standards, which help us provide hints to the user about what types of inputs are expected and what types of outputs are returned. When you hover over the function, you can now see hints about how to use the function.

In [7]:
def multiply(a: float = 1, b: float = 1) -> float:
    c = a * b
    return c

multiply(a=10, b=2)

20

In general, we will use software created by other teams. These are called packages and must be imported into the notebook. Some of the most popular packages include `numpy` for matrix manipulation, `pandas` for data storage and manipulation, and `matplotlib` for plotting. 

We can optionally give an alias, or nickname, to each package. This helps us refer to a long package name with a shortcut. While these are arbitrary, the common aliases are given below:

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Pandas

One popular library is `pandas`, which is commonly used for data manipulation. We will show some of the most common techniques you will use in pandas. Of course, feel free to refer to the official documentation to learn anything else [docs](https://pandas.pydata.org/)

Let's grab some data we will analyze in Lab 2 and show the basics!

We load up the data into a Pandas Data Frame.

In [9]:
# read dataset from local file
df = pd.read_csv('../datasets/non-voters/nonvoters_data.csv')

In [10]:
# show a sample of data from the first 10 rows
df.head(10)

Unnamed: 0,RespId,weight,Q1,Q2_1,Q2_2,Q2_3,Q2_4,Q2_5,Q2_6,Q2_7,...,Q30,Q31,Q32,Q33,ppage,educ,race,gender,income_cat,voter_category
0,470001,0.7516,1,1,1,2,4,1,4,2,...,2,,1.0,,73,College,White,Female,$75-125k,always
1,470002,1.0267,1,1,2,2,3,1,1,2,...,3,,,1.0,90,College,White,Female,$125k or more,always
2,470003,1.0844,1,1,1,2,2,1,1,2,...,2,,2.0,,53,College,White,Male,$125k or more,sporadic
3,470007,0.6817,1,1,1,1,3,1,1,1,...,2,,1.0,,58,Some college,Black,Female,$40-75k,sporadic
4,480008,0.991,1,1,1,-1,1,1,1,1,...,1,-1.0,,,81,High school or less,White,Male,$40-75k,always
5,480009,1.0591,1,3,2,3,4,1,3,3,...,5,,,-1.0,61,High school or less,White,Female,$40-75k,rarely/never
6,480010,1.1512,1,1,1,2,3,1,1,1,...,1,1.0,,,80,High school or less,White,Female,$125k or more,always
7,470008,1.0174,1,1,1,2,2,1,3,1,...,2,,1.0,,68,Some college,Other/Mixed,Female,$75-125k,always
8,470010,0.8184,1,1,1,1,3,1,1,1,...,1,1.0,,,70,College,White,Male,$125k or more,always
9,470011,1.1653,1,1,1,2,1,1,1,1,...,3,,,1.0,83,Some college,White,Male,$125k or more,always


In [11]:
# show a sample from the last ten rows
df.tail(10)

Unnamed: 0,RespId,weight,Q1,Q2_1,Q2_2,Q2_3,Q2_4,Q2_5,Q2_6,Q2_7,...,Q30,Q31,Q32,Q33,ppage,educ,race,gender,income_cat,voter_category
5826,477654,0.7655,1,1,1,1,2,1,1,1,...,3,,,2.0,47,High school or less,Hispanic,Male,$40-75k,sporadic
5827,477657,1.56,1,1,1,1,1,1,1,1,...,1,1.0,,,45,Some college,White,Female,$125k or more,sporadic
5828,477658,1.3218,1,1,1,2,4,2,4,2,...,4,,,1.0,37,Some college,White,Female,$40-75k,sporadic
5829,477660,1.2671,1,2,1,2,1,1,2,1,...,2,,2.0,,33,High school or less,Black,Female,Less than $40k,rarely/never
5830,477661,1.2518,1,1,2,1,2,2,1,1,...,5,,,2.0,26,High school or less,White,Female,$75-125k,rarely/never
5831,477662,1.1916,1,1,3,1,3,1,2,2,...,2,,1.0,,27,Some college,Hispanic,Male,$40-75k,always
5832,477663,1.4623,1,1,1,1,2,1,2,1,...,2,,2.0,,59,High school or less,White,Female,$125k or more,rarely/never
5833,488322,0.9252,1,1,2,1,3,1,1,2,...,2,,1.0,,51,College,Other/Mixed,Male,$125k or more,sporadic
5834,488325,2.6311,1,2,2,2,2,2,2,2,...,3,,,1.0,22,High school or less,Black,Female,Less than $40k,always
5835,477666,1.6218,1,1,3,2,3,1,1,2,...,5,,,2.0,22,High school or less,Black,Female,Less than $40k,always


In [12]:
# let's inspect this dataframe - this let's us quickly see what values we can expect in the DataFrame
df.describe()

Unnamed: 0,RespId,weight,Q1,Q2_1,Q2_2,Q2_3,Q2_4,Q2_5,Q2_6,Q2_7,...,Q29_6,Q29_7,Q29_8,Q29_9,Q29_10,Q30,Q31,Q32,Q33,ppage
count,5836.0,5836.0,5836.0,5836.0,5836.0,5836.0,5836.0,5836.0,5836.0,5836.0,...,1342.0,1342.0,1342.0,1342.0,1342.0,5836.0,1592.0,2002.0,2242.0,5836.0
mean,474653.997772,0.991023,1.0,1.246402,1.705106,1.63828,2.175977,1.277245,1.805517,1.491604,...,-0.926975,-0.758569,-0.697466,-0.81073,-0.700447,2.325051,1.36495,1.365634,1.220339,51.693797
std,3628.475677,0.345022,0.0,0.660253,0.866346,0.765741,1.091391,0.626386,1.011524,0.80812,...,0.375264,0.651835,0.716885,0.585638,0.71397,1.259642,0.519249,0.497046,0.958569,17.071561
min,470001.0,0.2298,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,22.0
25%,472069.75,0.79315,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,1.0,1.0,36.0
50%,474152.0,0.9676,1.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,2.0,1.0,1.0,1.0,54.0
75%,476217.5,1.1696,1.0,1.0,2.0,2.0,3.0,1.0,2.0,2.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,3.0,2.0,2.0,2.0,65.0
max,488325.0,3.0386,1.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,...,1.0,1.0,1.0,1.0,1.0,5.0,2.0,2.0,2.0,94.0


### Filtering and Slicing

It is very common to want a specific portion of data from the DataFrame. Perhaps we want to see a particular column, or maybe we want to filter the table in some way.

In [13]:
# grab a column of the DataFrame using the column name
df['gender']

0       Female
1       Female
2         Male
3       Female
4         Male
         ...  
5831      Male
5832    Female
5833      Male
5834    Female
5835    Female
Name: gender, Length: 5836, dtype: object

In [14]:
# grab a specific row or column by number
# the function `.iloc` is an index locator, which grabs all data given a pair of indices. 
# First, you must enter the row number, then the column number. The `:` symbol indicates
# we will grab all values, for example all rows
df.iloc[:, 3]

0       1
1       1
2       1
3       1
4       1
       ..
5831    1
5832    1
5833    1
5834    2
5835    1
Name: Q2_1, Length: 5836, dtype: int64

In [15]:
# grab a range of rows and columns
# we will grab the 0th to 10th rows, and up to the 3rd column
df.iloc[0:10, :3]

Unnamed: 0,RespId,weight,Q1
0,470001,0.7516,1
1,470002,1.0267,1
2,470003,1.0844,1
3,470007,0.6817,1
4,480008,0.991,1
5,480009,1.0591,1
6,480010,1.1512,1
7,470008,1.0174,1
8,470010,0.8184,1
9,470011,1.1653,1


In [16]:
# indexing wraps around, so we can call the last three columns like this
df.iloc[:, -3:]

Unnamed: 0,gender,income_cat,voter_category
0,Female,$75-125k,always
1,Female,$125k or more,always
2,Male,$125k or more,sporadic
3,Female,$40-75k,sporadic
4,Male,$40-75k,always
...,...,...,...
5831,Male,$40-75k,always
5832,Female,$125k or more,rarely/never
5833,Male,$125k or more,sporadic
5834,Female,Less than $40k,always
