# Introduction to Python and Jupyter Notebooks


### Good programming practices... 

Remember, **commenting**, **consistency** and **abstraction** were our 3 good programming practices to follow. 

Let's look at how to do these in Python... 

In [None]:
# To comment one line in Python, you can use a # 
# See, each # will comment out a line. 

''' 
You can also comment out multiple lines 
with a set of three apostrophes 
''' 

# Notice that any comments appear within VS Code in a different color, 
# this makes it easier to figure out if something will get executed by 
# the compiler when it is run. 

In [None]:
# Commenting in Python also can pick up on key terms that will help you read 
# the comment. 

# One key term is the TODO marker 
# In python, the TODO marker will be highlighted a different color. 

# To get full points for this assignment, make sure you complete all the TODO markers 

In [2]:
# For consistency, we want to make sure everything only has 
# one path forward to run. 

# Consistency can be upheld by using generic variables and abstracting out into functions 
# To declare a variable in python, you can use the equal operator. 

# declares a variable called apple that holds the string apple 
apple = "apple"
# declares a integar data type that holds the value 7 
math = 3+4 

print(apple)
print(math)

apple
7


In [10]:
# To build functions in python, we have to use the def operator

def mathFunction():
    print('I can do math')
    print('I can too')

mathFunction()

I can do math
I can too


In [None]:
# Functions in python has a bunch of needed components... 
    # The def operator at the beginning of the function 
    # The parenthesis next to the function name 
    # A colon at the end of the function name
    # Indented statements to execute within the function 

# For example, this wont work 

def brokenFunction: 
    print('I won\'t work')

def brokenFunction2(): 
print('I won\'t work')

def brokenFunction3()
    print('I won\'t work')

# all of these statements will throw errors 
# TODO: fix all of these above statements so no error is thrown when read 

In [12]:
# Functions can also have parameters 
# To add a parameter, place a variable name in the parenthesis, like this 

def functionWithParameter(a, b): 
    print(a + b)

functionWithParameter('Python is dynamically typed', ' so you can pass in strings or integars')
print("Just like this")
functionWithParameter(4, 5)

Python is dynamically typed so you can pass in strings or integars
Just like this
9


In [14]:
# you can also use a return statement to show the value computed in the function

def functionWithParameterReturn(a, b): 
    return a + b

functionWithParameterReturn(4, 5)

9

In [None]:
# TODO: Create a variable called siblings, set it equal to the number of siblings you have 
# TODO: Create a variable called cousins, set it equal to the number of cousins you have 
# TODO: Create a function called family that will take two values and add them together, 
# have the function return or print out the value of the summed parameters
# TODO: Pass in the siblings and cousins variable to the family function

### Data Structures 

Also new in python are a bunch of new names for data types/structures

Follow this chart to learn some new naming conventions 

__Data Types__

| In R... | In Python... | 
|----|-----|
| Numeric | If whole number then Integar, If decimal then float | 
| Character | String | 

__Data Structure__

| In R... | In Python... | 
|----|-----|
| Vector | List, Array | 
| Factor | Enumeration | 
| Dataframes | Dataframes | 
| Matrix | 2D Array (less commonly matrix) | 

In [15]:
# To build strings, you can use single or double quotes 

string1 = 'I am a string'
string2 = "I am a string"
string3 = ''' I am a multi 
lined string 
It's nice to be able to have it wrap'''

# To build integars and floats, you can just type out the value, no need to declare 

int1 = 9 
int2 = 10 
float1 = 9.8
float2 = 111.1111

# Also, note... 
# int and float are reserved values. So you shouldn't use them as a standalone variable name 
# they can be used to strictly define the type of a variable you are declaring 

In [16]:
# To build a list or an array, use brackets 
array1 = ["array of strings", "array of strings"]
array2 = ['multi type array', 9, 10.11]
array3 = [10.2, 30.4, 10.4]

# nest the arrays inside each other like this 
bigArray = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]


### Accessing Data Structures 

In [21]:
# Accessing information is similar in Python 
# Each element has it's own index... except in python indexing starts at 0 

print(array2[0])
print(array2[1])
print(array2[2])

# you can also get the length of the array with the length command 
print("\nLength is:")
print(len(array2))

multi type array
9
10.11

Length is:
3


In [29]:
# Accessing 2Ds is the same, just add in more brackets 

print(bigArray[0])
print(bigArray[0][0])
print(bigArray[0][2])

[1, 2, 3]
1
3


In [None]:
# TODO, print out the 2nd element of the 3rd array in the bigArray defined above 

In [None]:
# TODO, create a 2D array with a bunch of different elements
# The array should be a 4x3 array, meaning it has 4 arrays inside of it and each array has 3 elements. 

### Getting pandas onto our virtual enviornment 

In order to get csv files into python, we can use a package called pandas. Fun name for a fun package! To import a package into our virutal enviornment we can execute the 'pip install' command into a python cell like this... 

In [33]:
%pip install pandas

Collecting pandas
  Downloading pandas-2.1.1-cp39-cp39-macosx_10_9_x86_64.whl.metadata (18 kB)
Collecting numpy>=1.22.4 (from pandas)
  Downloading numpy-1.26.1-cp39-cp39-macosx_10_9_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.2/61.2 kB[0m [31m339.1 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2023.3.post1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.1 (from pandas)
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m341.8/341.8 kB[0m [31m964.1 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Downloading pandas-2.1.1-cp39-cp39-macosx_10_9_x86_64.whl (11.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading numpy-1.26.1-cp39-cp39-macosx_10_9_x86_64.whl (20.6 MB)
[2K   [90m━━━━━━━━━━━━━━━

In [34]:
# Once it is in the virutal enviornment, we have to import it to this notebook 

import pandas as pd 

# To use pandas to grab a csv, then you can use the read_csv function 

df = pd.read_csv('~/Downloads/colleges.csv')

In [37]:
# TO get overall distributions of variables in your dataframe 
df.describe()

Unnamed: 0,OPEID,median_debt,default_rate,admit_rate,SAT_avg,enrollment,net_price,avg_cost,net_tuition,ed_spending_per_student,avg_faculty_salary,pct_PELL,pct_fed_loan,grad_rate,pct_firstgen,med_fam_income,med_alum_earnings
count,4435.0,4435.0,4435.0,1704.0,1105.0,4435.0,4435.0,4435.0,4435.0,4435.0,3077.0,4435.0,4435.0,4435.0,4088.0,4399.0,3912.0
mean,1492464.0,11.19579,9.06009,70.812576,1139.842534,3110.519053,17.371474,27.10288,10.836639,7.760832,7.266518,45.55554,49.069461,54.945651,43.357756,31.79193,40.007157
std,1976276.0,5.319178,6.144554,20.567925,131.630792,6429.445325,8.638514,14.988075,7.50641,6.881391,2.528365,20.309775,24.542281,22.051351,12.931312,20.811117,14.486256
min,100200.0,1.932,0.0,2.44,760.0,0.0,-0.407,4.76,0.0,0.0,0.897,0.0,0.0,0.0,8.866995,0.0,10.939
25%,282200.0,6.863,4.4,59.7875,1050.0,171.0,10.849,16.4525,5.4395,4.126,5.61,29.83,30.925,37.31,35.006281,17.82775,29.72025
50%,766900.0,9.5,8.2,74.68,1113.0,868.0,16.757,22.945,9.912,6.352,6.958,42.5,52.54,56.4,45.102178,24.67,38.056
75%,2362002.0,15.0,12.3,86.115,1205.0,2953.0,22.4705,32.0325,14.218,9.342,8.573,60.38,67.68,71.915,52.599727,39.5165,47.38125
max,72098870.0,33.47,57.1,100.0,1566.0,109233.0,112.05,120.377,66.442,139.766,21.143,100.0,100.0,100.0,85.90604,179.864,132.969


In [38]:
# To get the column names of the dataframe 

df.columns

Index(['OPEID', 'name', 'city', 'state', 'region', 'median_debt',
       'default_rate', 'highest_degree', 'ownership', 'locale', 'hbcu',
       'admit_rate', 'SAT_avg', 'online_only', 'enrollment', 'net_price',
       'avg_cost', 'net_tuition', 'ed_spending_per_student',
       'avg_faculty_salary', 'pct_PELL', 'pct_fed_loan', 'grad_rate',
       'pct_firstgen', 'med_fam_income', 'med_alum_earnings'],
      dtype='object')

In [39]:
# To get the dimension of a dataframe 

df.shape

(4435, 26)

In [43]:
# To get a particular column of dataframe 

df['SAT_avg']

0        959.0
1       1245.0
2          NaN
3       1300.0
4        938.0
         ...  
4430       NaN
4431       NaN
4432       NaN
4433       NaN
4434       NaN
Name: SAT_avg, Length: 4435, dtype: float64

In [47]:
# Then you can use the same math functions on that column like in R 

print(min(df['SAT_avg']))
print(max(df['SAT_avg']))

760.0
1566.0


## Homework... 

For your homework, the goal is to apply your knowledge to your dataset you picked out for your project. 

In [None]:
# TODO Read your project dataset into a dataframe called df

In [None]:
# TODO Find out the dimensions of the project dataset 

In [None]:
# TODO Display only one column (any column) of the project dataset 

In [None]:
# TODO Choose one of the numerical columns of your project dataset... assign this column to a variable called colNumbers
# TODO Find the min, max, mean, and count of this column 

In [49]:
# TODO Look up the function value_counts... on one of your categorical variables,
#  use this function to display how many rows fall into this category 


In [None]:
# TODO Use the describe function on your dataset 

In [None]:
# TODO In your first project brainstorming assignment, you asked 3 questions you would ask of your dataset. 
# Place one of these questions in a comment within this notebook

# In EITHER code or writing, please show/describe how you could answer this question with this data. 
# For example, you could show the mean/max/min of two different columns in your dataset and then tell me how the comparison 
# between these two would help answer your question. This is super preliminary, so nothing complicated needs to be done here. 
# The goal is to get you thinking about how you need to interact with your dataset to start answering your questions. 

### To submit... 

Follow the proper naming convention mentioned on moodle, execute all cells within this notebook and then save the .ipynb file. Submit the .ipynb. 