# Introduction to Python and Jupyter Notebooks


### Commenting in Python 

Remember that "commenting" is an important practice for making your code more readable by:
- documenting what our code is doing - so that others can follow and so that we can easily remember
- keeping code organized
- proving notes about the status
- leaving comments for collaborators or reviewers

Let's look at how to comment in Python... 

In [18]:
# To comment one line in Python, you can use a #, just like in R

''' 
You can also comment out multiple lines 
with a set of three apostrophes 
''' 

# Notice that any comments appear within VS Code in a different color, 
# this makes it easier to figure out if something will get executed by 
# the compiler when it is run. 

' \nYou can also comment out multiple lines \nwith a set of three apostrophes \n'

#### Markers
In Python, comments are used to explain code and make it more readable. "Markers" in the context of comments usually refer to specific annotations or conventions within comments that are used to highlight certain aspects or functionalities of the code. These markers can be simple text labels or more formalized tags recognized by certain tools or IDEs. For example, you might check out extension "To Do Tree" to recognzie markers and highlight them. 

Common Types of Comment Markers and Examples: 

In [None]:
# TODO: Indicates points in the code where work needs to be done. This is a conventional marker used by many developers and often recognized by IDEs to create a list of tasks.
# For example:
# TODO: Implement error handling

# FIXME: Highlights a problem that needs to be corrected.
# For example:
# FIXME: This section often causes a divide by zero error

# NOTE or INFO: Used to give additional information about the code, which is not immediately apparent.
# For example:
# NOTE: This function is deprecated in version 2.0

# HACK: Marks non-obvious or non-intuitive solutions to problems. Useful to indicate potential technical debt.
# For example:
# HACK: Temporary fix until the library updates



#### Best Practices for Using Comment Markers

- Consistency: Be consistent in how you use markers. If you start using TODO for tasks, stick with it throughout the project.
- Clarity: Make your comments clear and concise. The marker should be followed by a brief, but informative description.
- Context: Provide enough context in the comment so that someone else (or you in the future) can understand the issue or rationale without needing to read large sections of code.
- Prioritization: If possible, indicate the priority or severity of the issue next to the marker.
- Integration with Tools: Some IDEs and code quality tools can detect these markers and provide a summary or reminders about them. Leveraging such features can enhance productivity.
- Avoid Overuse: While markers can be very useful, overusing them can make your code cluttered and harder to read. Use them judiciously and only where they add value.


### Data Types and Data Structures in Python

The data types and structures are very similar but some have different terminology...

#### Data Types

| In R... | In Python... | 
|----|-----|
| Numeric | If whole number then "integar." If decimal then "float." | 
| Character | String | 



In [2]:
# To build integars and floats, you can just type out the value, no need to declare 
# Note we use = instead of <-
# Also note, in Python, we cannot use "." in the object name as we can in R

my_int1 = 9 
my_int2 = 10 
my_float1 = 9.8
my_float2 = 111.1111

# To build strings, you can use single or double quotes 

my_string1 = 'I am a string'
my_string2 = "I am a string"
my_string3 = ''' I am a multi 
lined string 
It's nice to be able to have it wrap'''


#### Data Structures

| In R... | In Python... | 
|----|-----|
| Vector | List, Array | 
| Factor | Enumeration | 
| Dataframes | Dataframes | 
| Matrix | 2D Array (less commonly matrix) | 

In [4]:
# To build a list or an array, use square brackets 
# Compare this to c() in R
array1 = ["array of strings", "array of strings"]
array2 = ['multi type array', 9, 10.11]
array3 = [10.2, 30.4, 10.4]

# You can nest the arrays inside each other like this 
bigArray = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]


[[1, 2, 3], [4, 5, 6], [7, 8, 9]]


### Accessing Data Structures 

In [27]:
# Accessing information is similar in Python 
# Each element has it's own index... except in python indexing starts at 0 

print(array2)
print(array2[0])
print(array2[1])
print(array2[2])

# you can also get the length of the array with the length command 
print("\nLength is:")
print(len(array2))

# Accessing 2Ds is the same, just add in more brackets 
print(bigArray[0])
print(bigArray[0][0])
print(bigArray[0][2])

['multi type array', 9, 10.11]
multi type array
9
10.11

Length is:
3
[1, 2, 3]
1
3


## Working with Data Frames in Python

### Installing `pandas`

To import a package into our virutal enviornment we can execute the 'pip install' command into a python cell like this... 

In [28]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


### Some notes about packages in Python

#### Installation
When you run pip in the terminal, it affects the Python environment that is currently active. If you're using a virtual environment as we are, pip will install packages into that environment. If no virtual environment is active, it will install packages into your global Python environment.

%pip is a special command used in Jupyter notebooks. You use it within a Jupyter notebook cell to ensure that the package is installed in the same Python environment the notebook is currently using. This is particularly useful because sometimes the environment a Jupyter notebook runs in isn't the same as your default Python environment, especially if you're using virtual environments.

In [10]:
# Once it is in the virutal enviornment, we have to import it to this notebook 

import pandas as pd 


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


#### Finding documentation

- **Online Documentation:** The most comprehensive source is the official Pandas documentation. It includes a detailed user guide, API reference, and examples.

- **Using help() Function:**

  - For example, for general Pandas documentation: help(pd)
  - For documentation of a specific function (e.g., read_csv): help(pd.read_csv)

- **Accessing Docstrings:**

  - In an interactive Python session (like a Jupyter notebook), you can use:
pd.read_csv? to view the docstring of read_csv.
pd? to view the docstring for Pandas.
IDE Features: If you're using an IDE, hovering over a function name often shows a tooltip with documentation. Also, pressing Shift + Tab in Jupyter notebooks shows the docstring for the function.


In [None]:
help(pd)

For specific functions:, using the example of `read_csv`:
 
 - help(pd.read_csv)

 This is a standard Python function that invokes the built-in help system. When you use help(pd.read_csv), it displays the documentation for the read_csv function in a more detailed, text-based format. This command is versatile and can be used in any Python environment, including a standard Python shell, scripts, and Jupyter notebooks. The output is typically displayed in the same area where the command was executed.

- pd.read_csv?

  This syntax is specific to Jupyter notebooks. 
  When you use pd.read_csv? in a Jupyter notebook, it displays the documentation in a separate pane or window at the bottom of the notebook interface. This pane can be resized, scrolled, or closed. The documentation displayed is generally more concise and is formatted for quick readability, focusing on the most essential aspects of the function.
  This method is more interactive and user-friendly, especially in a Jupyter notebook environment, but it is not available in standard Python shells or scripts.



In [None]:
# Uncomment the line below to test. 

# help(pd.read_csv)

 
# Uncomment the line below to test. 

# pd.read_csv?


Now, let's use `read_csv` to read in a new dataset, "colleges.csv." This data file is available on Github.[https://github.com/kristinporter/DSC201_602/blob/main/data/colleges.csv].

A few notes about calling functions: In the example below...

- pandas (aliased as pd) is the package.
- read_csv is the function being called.
- we include the path and file name within ()

In [20]:
# Read in the file 
df = pd.read_csv('data/colleges.csv')

# To get overall distributions of variables in your dataframe 
df.describe()

Unnamed: 0,OPEID,median_debt,default_rate,admit_rate,SAT_avg,enrollment,net_price,avg_cost,net_tuition,ed_spending_per_student,avg_faculty_salary,pct_PELL,pct_fed_loan,grad_rate,pct_firstgen,med_fam_income,med_alum_earnings
count,4435.0,4435.0,4435.0,1704.0,1105.0,4435.0,4435.0,4435.0,4435.0,4435.0,3077.0,4435.0,4435.0,4435.0,4088.0,4399.0,3912.0
mean,1492464.0,11.19579,9.06009,70.812576,1139.842534,3110.519053,17.371474,27.10288,10.836639,7.760832,7.266518,45.55554,49.069461,54.945651,43.357756,31.79193,40.007157
std,1976276.0,5.319178,6.144554,20.567925,131.630792,6429.445325,8.638514,14.988075,7.50641,6.881391,2.528365,20.309775,24.542281,22.051351,12.931312,20.811117,14.486256
min,100200.0,1.932,0.0,2.44,760.0,0.0,-0.407,4.76,0.0,0.0,0.897,0.0,0.0,0.0,8.866995,0.0,10.939
25%,282200.0,6.863,4.4,59.7875,1050.0,171.0,10.849,16.4525,5.4395,4.126,5.61,29.83,30.925,37.31,35.006281,17.82775,29.72025
50%,766900.0,9.5,8.2,74.68,1113.0,868.0,16.757,22.945,9.912,6.352,6.958,42.5,52.54,56.4,45.102178,24.67,38.056
75%,2362002.0,15.0,12.3,86.115,1205.0,2953.0,22.4705,32.0325,14.218,9.342,8.573,60.38,67.68,71.915,52.599727,39.5165,47.38125
max,72098870.0,33.47,57.1,100.0,1566.0,109233.0,112.05,120.377,66.442,139.766,21.143,100.0,100.0,100.0,85.90604,179.864,132.969


In [23]:
# To get the column names of the dataframe 

df.columns

Index(['OPEID', 'name', 'city', 'state', 'region', 'median_debt',
       'default_rate', 'highest_degree', 'ownership', 'locale', 'hbcu',
       'admit_rate', 'SAT_avg', 'online_only', 'enrollment', 'net_price',
       'avg_cost', 'net_tuition', 'ed_spending_per_student',
       'avg_faculty_salary', 'pct_PELL', 'pct_fed_loan', 'grad_rate',
       'pct_firstgen', 'med_fam_income', 'med_alum_earnings'],
      dtype='object')

In [24]:
# To get the dimension of a dataframe 

df.shape

(4435, 26)

In [25]:
# To get a particular column of dataframe 

df['SAT_avg']

0        959.0
1       1245.0
2          NaN
3       1300.0
4        938.0
         ...  
4430       NaN
4431       NaN
4432       NaN
4433       NaN
4434       NaN
Name: SAT_avg, Length: 4435, dtype: float64

In [26]:
# Then you can use the same math functions on that column like in R 

print(min(df['SAT_avg']))
print(max(df['SAT_avg']))

760.0
1566.0
