# Python Coding Workshop

4/15/21

## Overview

Graduate Student Instructor: Kayleigh Barnes

Email: kayleighnb@berkeley.edu

### Goals for today
This session is intended to guide you through the practical implementation of basic analytic techniques in Python in Jupyter notebooks. Python is an open-source statistical computing software used to analyze data (among many, many other things). A Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. This workshop will be focused on interactive demonstration in Python, but also include time for additional questions and guidance in working through the sample code. We will cover some fundamental coding techniques that will help you in Econ 140, basic data science classes, or research assistant positions. This workshop is for *beginners* that have little or no coding experience.



### Important notes 
- One attendee from today's workshop will be randomly selected to win a 20 dollar gift card to Amazon 
- Attendance to this workshop comes with free access to datacamp through July. Datacamp offers online courses in both R and Python so that you can continue learning after today's workshop 
- Link to join Berkeley Econ's datacamp group with @berkeley.edu ID: [here](https://www.datacamp.com/groups/shared_links/9cecd27b5daab26dc69f7d4a48b3c2ae5e20ff9ed77e3e239fa2e4510a4848d3) (make sure you're signed out of datacamp before clicking this - otherwise the sign-up gets screwed and you'll be asked to pay after the first chapter of any course)

## Jupyter and Python Basics
- To create a new notebook, click the "New" button and select Python 3
- Write Python script by selecting the option "Code" from the dropdown list, or write text by selecting "Markdown"
- Select "Insert" to add a block of text or code
- Run code by highlighting and selecting "Run"
- Use the # symbol to add comments to the script, or to add headlines to text selections
- To clear your coding output, select Cell=>All Output=>Clear 

User written open-source libraries are needed for specific functionality in python (e.g. nice graphics, data analysis). However, we need to manually install these libraries (once) and load them at the beginning of every script. Libraries have been pre-installed in Jupyter notebooks. If you are wondering why a command you've used before is no longer working, it may be because you haven't loaded the library.

In [None]:
import numpy as np #Numeric Python, allows us to sort and index data mong many many other things (next two packages use it as a base)
import scipy as sp #Scientific Python, similar to numpy but with more linear algebra capabilities
from scipy import stats
import pandas as pd #this is the library that enables interacting with data very important!
import os #Operating system, will use to manage working directory
import matplotlib.pyplot as plt #for making nice graphs
plt.style.use('seaborn-whitegrid')

In [None]:
# The help function, using help() before a command will bring up information on what the command does
help(print)

In [None]:
#The working directory is the location that R will look for data in
    # this is the same as telling your computer to look in a documents folder when uploading soemthing
os.getcwd()
#os.chdir('/home/jovyan/my-work') #remove the first # from this line to run code that changes the working directory


## Loading in data and summary statistics

Now let's load in the data set. Make sure you have uploaded the data to Jupyter before running the next line of code. We are going to use data on a set of households in Mexico in the 1990's. The data includes a village ID, a household ID, and demogrpahic variables like income, household size, age and gender of the head of household and a poverty indicator. 

In [None]:
MyFirstData = pd.read_csv('Data/MyFirstData.csv') 

Notice that there is no ouput from the code that reads in the data. Unlike excel, R stores the data in the background and we need to use specific comands to interact with it. Once it's read in, we can use several commands to describe the data.

In [None]:
# Information about the structure of the data
MyFirstData.info()

In [None]:
# summary statistics for the data 
MyFirstData.describe(include='all')

In [None]:
# print the names of the columns of the data
MyFirstData.columns

In [None]:
# number of rows and number of columns
MyFirstData.shape

In [None]:
# first X rows of the data
MyFirstData.head(6)

In [None]:
# display values and counts of categorical data 
MyFirstData['sexhead'].value_counts()

## Basic Data Cleaning and Formatting

### Category Variable

Right now, we have two categorical variables: sexhead, which indicates the sex of the head of household and pov_HH, which indicates whether a household is below the poverty line. The data entries for these variables are text rather than numbers (we call these string variables in the data science world). Often when doing data analysis, it is easier to map categorical text variables to numbers, particularly 0 and 1. These variables that contain only 0's and 1's are called dummy variables. 

Now, suppose we want to create a poor_male variable, which will be defined as 1 if the household is categorized as poor (pov_HH = pobre) and the head of the household is male (sexhead is Male), and 0 otherwise.

In [None]:
# first, lets create dummy variables out of sexhead and pov_HH using the map function
MyFirstData['sexhead_male'] = MyFirstData['sexhead'].map({'Male':1, 'Female':0})
MyFirstData['pov_HH_pobre'] = MyFirstData['pov_HH'].map({'pobre':1, 'no pobre':0})

# compare this output to the output above to make sure it worked correctly
MyFirstData['sexhead_male'].value_counts()

In [None]:
MyFirstData['poor_male']=MyFirstData['pov_HH_pobre']*MyFirstData['sexhead_male']
MyFirstData['poor_male'].value_counts()

### Numerical Variable
We can use regular mathematical operations to create numerical variables from other variables.

In [None]:
MyFirstData['agehead2'] = MyFirstData['agehead']**2
MyFirstData['agehead2'].describe()

In [None]:
MyFirstData['constant'] = 1
MyFirstData['constant'].describe()

 ### New Datasets
 We may also want to create a new data that summarizes the old, or is a subset of the original dataset.

In [None]:
#Subset of only observations with male head of hh
data_males=MyFirstData.loc[MyFirstData['sexhead_male']==1]
data_males.describe(include='all')

In [None]:
meandata = MyFirstData.groupby('villid').agg({'IncomeLab':['mean'],
                            'famsize':['mean'],
                            'agehead':['mean']}).reset_index()
meandata.columns = ['villid', 'meanIncomeLab', 'meanfamsize', 'meanagehead']
meandata.describe(include='all')

## Making comparisons - T-Tests

A main goal of working with data is to make inferences about the population we are interested in. Much of Econ 140 will be focused on methods to make these inferences: What is the relationship between two variables? Did an experiment have a significant treatment effect?

If you have taken Stats 20, you are likely already familiar with a t-test. T-tests compare the difference in the means of a variable between two groups. The test statistic tells us whether the difference is *significant*, that is we can confidently say that the two groups are different. 

In [None]:
MyFirstData.groupby('pov_HH').mean()

In [None]:
cat1 = MyFirstData[MyFirstData['pov_HH']=='pobre']
cat2 = MyFirstData[MyFirstData['pov_HH']=='no pobre']

stats.ttest_ind(cat1['famsize'], cat2['famsize'])

## Visualizing Data
We will use the library matplotlib to make some graphs

In [None]:
plt.figure(figsize=(9,7))
plt.scatter(MyFirstData['agehead'], MyFirstData['famsize'])

In [None]:
plt.figure(figsize=(9,7))
plt.hist(MyFirstData['famsize'], density=True, bins=30)
plt.xlabel('Family Size')
plt.title('Histogram of Family Size')