# Practice with Jupyter Notebooks and Basic Python Commands

**Overview:** For this assignment, you will practice working with Jupyter notebooks, navigate documentation for Python packages and functions, and read in the data you have selected for your class project. 

**Directions:** Please work through the notebook, answering all the questions. 
When you are done, export to a PDF file. Upload it to Gradescope using the following naming convention for your file: 

DSC201_602_SP24_codingassignment4_unityID

(For example, DSC201_602_SP24_codingassignment4_keporte2)

As a reminder, you can export to a html file within VS Code by clicking on the three dots ("...") to the right of "Outline" in the panel at the top of your screen. Then print your html to a PDF. If you have trouble exporting in VS Code, please upload your file to Jupyter Hub (https://jhub.cos.ncsu.edu/) by clicking on the up error over the horizontal line (2 icons to the right of the blue plus button). Then you can export to a PDF by going to "File" then "Save and Export Notebook As." 

**Points:** 25 points (plus 3 extra credit points)

**Due:** February 14 at 11:59 PM 

## The `NumPy` package

`NumPy` is a popular Python package used primarily for numerical calculations.  It is a powerful tool that handles large sets of numbers and mathematical operations efficiently. It's widely used in data science to process and analyze data. `NumPy`'s ability to perform complex calculations quickly makes it a foundational tool for other data science libraries. 

Uncomment and run the command below to install `NumPy`. Note you will need to retstart your kernel after you install it. In VS Code and on Jupyter Hub, there is a restart button in the top panel. You can comment out the installation command after you have successfully installed the package.

In [1]:
#%pip install numpy

Next, we import `numpy`. We shorten the imported name to np for better readability of code using NumPy. This is a widely adopted convention that makes your code more readable for everyone working on it. 

Then, `help(np)` brings up documentation. Go to the listed link of numpy.org and after opening it, click on "Documentation." Find the answers to the following questions: 

**(1) Similar to how R's `dplyr` has somewhat different versions of the data structures in base R, Python's `numpy` has somewhat different versions of the data structures in standard Python. For example, an "array" is a central data structure in `numpy` which can be compared to a "list" in standard Python. What is at least one difference between a "list" in `numpy` and an "array" in standard Python? [Enter your answer below, 1 POINT]**

Standard Python Lists: They are heterogeneous, meaning they can contain elements of different data types within the same list (e.g., integers, strings, objects).
NumPy Arrays: These are homogeneous, meaning all elements in a NumPy array must be of the same data type. This constraint allows NumPy to efficiently perform vectorized operations, utilizing contiguous memory allocation and optimized C and Fortran libraries for fast numerical computations.

**(2) In `numpy`, what is the relationship between a vector and array? What is relationship between a matrix and array? [Enter your answer below, 1 POINT]**


A vector in NumPy is essentially a 1D array. It is a single row of elements or a single column of elements – essentially a list of numbers.

A matrix in NumPy is a 2D array. It is a rectangular grid of elements arranged in rows and columns. Each row and each row is a vector. 


In [2]:
import numpy as np

# uncomment to test, but then recomment so you don't have this long output in your final document
# help(np)

**(3) In the following code chunk, write a line a code to accomplish each task described in the comments. Refer to documentation at numpy.org to guide you; it provides lots of examples that you can copy and modify. [6 POINTS TOTAL, 1 POINT EACH]**

In [3]:
# Create a one-dimensional NumPy array of numbers from 1 to 10.
my_array = np.array([1,2,3,4,5,6,7,8,9,10])
#my_array = np.arange(1, 11) # This also works. 
print(my_array)

# TODO: Print the first element of the array you just created.
print(my_array[0])

# TODO: Find the length of the array.
print(len(my_array))

# TODO: Find and print the mean of your array
print(np.mean(my_array))

# TODO: Reshape your array into a matrix with 2 columns and 5 rows, assign this to a new object and print it (2 lines of code)
my_matrix = my_array.reshape(5,2)
print(my_matrix)

# TODO: Find and print the sum of the first column of your matrix
sum_first_column = np.sum(my_matrix[:, 0])
print(sum_first_column)

[ 1  2  3  4  5  6  7  8  9 10]
1
10
5.5
[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]]
25


**(4) A valuable way to expand your coding skills is to run example code - from documentation, tutorials, other coders - and see what it accomplishes. Using the documentation at numpy.org, create a code chunk and test out three NumPy commands that either operate on the array or matix that you created above. After running the commands, use comments to explain what the command does and anything you think is important to remember about how the command is used, or something you do not understand. (3 POINTS FOR COMPLETION)**

In [4]:
my_matrix.sum(axis=0)
my_array.reshape(10,1)



array([[ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]])

## Getting familiar with `pandas`


While NumPy is valuable for computations on arrays and matrices, we will work more with Pandas (which is built on top of NumPy), which is valuable for more complex data manipulation tasks on structured data. You should have already installed pandas when running Python-Notebook1. If not, uncomment and run the first line in the chunk below. In either case, run the import command below to load the Pandas library into your environment. 

In [5]:
# %pip install pandas

# Import the pandas library and assign the alias "pd"
import pandas as pd 


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


#### Reviewing documentation

For `pandas` documentation, check out: https://pandas.pydata.org/docs/ (which is much more user friendly than what you get with help(pd).) Review the "package overview" and answer the following questions:

**(5) What are at least two things you can do with the `pandas` package that accomplish similar (or identical) tasks you learned with `dplyr` in R? [1 POINT]**

1. **Filtering Data**:
   - In `dplyr`, you use `filter()` to select rows based on a condition. For example: `filter(df, condition)`.
   - In `pandas`, you can achieve this with boolean indexing or the `.query()` method. For example: `df[df['column'] condition]` or `df.query("column condition")`.

2. **Selecting Columns**:
   - `dplyr` uses `select()` to choose specific columns from a dataframe. For example: `select(df, column1, column2)`.
   - In `pandas`, you can select columns by passing a list of column names. For example: `df[['column1', 'column2']]`.

3. **Creating or Transforming Columns**:
   - In `dplyr`, `mutate()` is used to create new columns or modify existing ones. For example: `mutate(df, new_column = existing_column + 10)`.
   - In `pandas`, you can directly create or modify columns using assignment. For example: `df['new_column'] = df['existing_column'] + 10`.

4. **Grouping and Summarizing Data**:
   - `dplyr` uses `group_by()` in combination with `summarise()` (or `summarize()`) to group data and then apply summary functions. For example: `df %>% group_by(group_column) %>% summarise(mean_value = mean(value_column))`.
   - In `pandas`, this is achieved using `.groupby()` followed by aggregation methods like `.mean()`. For example: `df.groupby('group_column')['value_column'].mean()`.

5. **Sorting Data**:
   - `dplyr` provides `arrange()` for sorting data frames. For example: `arrange(df, column)`.
   - In `pandas`, the equivalent is the `.sort_values()` method. For example: `df.sort_values(by='column')`.


**(6) What are the two main data structures in `pandas` and what are the parallel data structures in base R or `dplyr`? [1 POINT]**

In `pandas`, the two main data structures are:

1. **DataFrame**: 
   - This is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It's akin to a spreadsheet or SQL table.
   - **Parallel in R**: The equivalent in base R would be a `data.frame`. In `dplyr`, a `DataFrame` is conceptually similar to a `tbl_df` (or tibble), which is a modern reimagining of the data frame.

2. **Series**: 
   - A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index. It's similar to a single column in a spreadsheet.
   - **Parallel in R**: The closest equivalent to a `Series` in R is a `vector`. 


**(7) Within this Pandas website, go to https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html#quick-reference. Use this "quick reference" to complete the following. [4 POINTS TOTAL, 1 POINT EACH]**

Note: The functions are preceded by `pandas` and a "." or more frequently and preferably, the alias you used when you imported it (pd). So "pd.functionname."


In [6]:
# First, I will create a toy data frame for you. 
df = pd.DataFrame({
    'ColA': [1, 2, 3],
    'ColB': [0.1, 0.2, 0.3],
    'ColC': [True, False, True]
})
print(df)

# TODO: Find the dimensions of the data frame df
df_dimensions = df.shape
print("Dimensions of DataFrame:", df_dimensions)

# TODO: Summarize all columns in the data frame df (Hint: for this one the () will be empty and you need to include "print()")
summary = df.describe(include='all')
print("Summary of all columns:\n", summary)

# TODO: Create a reduced data frame (a new object) that is just the first two columns
reduced_df = df[['ColA', 'ColB']]
print("Reduced DataFrame:\n", reduced_df)

# TODO: Run another command of your choice and add a comment describing what it does. Be sure to print the results. 
# Counting the number of observations with ColC equal to TRUE
true_count = df['ColC'].sum()
print("Number of True values in 'ColC':", true_count)


   ColA  ColB   ColC
0     1   0.1   True
1     2   0.2  False
2     3   0.3   True
Dimensions of DataFrame: (3, 3)
Summary of all columns:
         ColA  ColB  ColC
count    3.0  3.00     3
unique   NaN   NaN     2
top      NaN   NaN  True
freq     NaN   NaN     2
mean     2.0  0.20   NaN
std      1.0  0.10   NaN
min      1.0  0.10   NaN
25%      1.5  0.15   NaN
50%      2.0  0.20   NaN
75%      2.5  0.25   NaN
max      3.0  0.30   NaN
Reduced DataFrame:
    ColA  ColB
0     1   0.1
1     2   0.2
2     3   0.3
Number of True values in 'ColC': 2


For specific functions within a package, it can be quick and helpfl to pull up documentation from inside your Jupyter notebook. Using the example of `read_csv`:
 
 - help(pd.read_csv)

 This is a standard Python function that invokes the built-in help system. When you use help(pd.read_csv), it displays the documentation for the read_csv function in a more detailed, text-based format. This command is versatile and can be used in any Python environment, including a standard Python shell, scripts, and Jupyter notebooks. The output is typically displayed in the same area where the command was executed.

- pd.read_csv?

  This syntax is specific to Jupyter notebooks. 
  When you use pd.read_csv? in a Jupyter notebook, it displays the documentation in a separate pane or window at the bottom of the notebook interface. This pane can be resized, scrolled, or closed. The documentation displayed is generally more concise and is formatted for quick readability, focusing on the most essential aspects of the function.
  This method is more interactive and user-friendly, especially in a Jupyter notebook environment, but it is not available in standard Python shells or scripts.

**(8) Give these a try below, and provide 2 observations for each: [Enter responses below within this markdown chunk, 4 POINTS]**

In [7]:
# TODO: Uncomment the line below to test. 

# help(pd.read_csv)

 
# TODO: Uncomment the line below to test. 

#pd.read_csv?


**(9) Now, let's use `read_csv` to read in the datasests you have selected for your project. I am assuming you have .csv files. If not, then write a .csv file in R so that you can read it in here. After reading in your file, carry out the tasks described in the comments. [4 POINTS TOTAL, 1 POINT EACH]**

In [8]:
# Read in the file by editing the example below
projectData = pd.read_csv('data/colleges.csv')

# TODO: Display the dimensions of your data frame projectData
projectData_dimensions = projectData.shape
print("Dimensions of DataFrame:", projectData_dimensions)

# TODO: Summarize (describe) the columns in projectData
summaryProject = projectData.describe(include='all')
print("Summary of all columns:\n", summaryProject)

# TODO: Display the first 10 rows of ProjectData and at the same time, 5 columns of your choice
selectedCols = ['name','state','admit_rate','SAT_avg']
print(projectData[selectedCols].head(10))


Dimensions of DataFrame: (4435, 26)
Summary of all columns:
                OPEID               name      city state   region  median_debt  \
count   4.435000e+03               4435      4435  4435     4435  4435.000000   
unique           NaN               4357      1943    54        7          NaN   
top              NaN  Cortiva Institute  New York    CA  Midwest          NaN   
freq             NaN                  6        51   423     1074          NaN   
mean    1.492464e+06                NaN       NaN   NaN      NaN    11.195790   
std     1.976276e+06                NaN       NaN   NaN      NaN     5.319178   
min     1.002000e+05                NaN       NaN   NaN      NaN     1.932000   
25%     2.822000e+05                NaN       NaN   NaN      NaN     6.863000   
50%     7.669000e+05                NaN       NaN   NaN      NaN     9.500000   
75%     2.362002e+06                NaN       NaN   NaN      NaN    15.000000   
max     7.209887e+07                NaN       Na

**Extra Credit:**

**(1) Create a new object to hold a modified version of `projectData`. The modified version should (a) rename one variable, (b) drop one variable and (c) create one new variable that is a computation based on another variable in your data (e.g., it sums to variables together or it multiplies a variable by a number, etc. ). [1 POINT]**

**(2) Describe what you would do to check that you modifications worked. How would you make it easy to do the checks (rather than viewing lots of information you don't need). [1 POINT]**

**(3) Write code to implement your check [1 POINT]**


In [9]:
projectData_mod = projectData.copy()

# 1. Rename a variable/column
projectData_mod = projectData_mod.rename(columns={'OPEID': 'collegeID'})

# 2. Create a new variable/column based on a calculation of an existing column
projectData_mod['admit_rate_01'] = projectData_mod['admit_rate'] /100

# 3. Drop a column
projectData_mod = projectData_mod.drop('admit_rate', axis=1)

# Checking with a print of columns involved
selCol_orig = projectData[['OPEID','admit_rate']]
selCol_mod = projectData_mod[['collegeID','admit_rate_01']]
checkdf = pd.concat([selCol_orig,selCol_mod],axis=1)
print(checkdf.head(10))

# Also making sure OPEID and admit_rate are gone
print(list(projectData_mod.columns))

# Better yet, so I don't have to trust my reading of all the col names:
# Returns TRUE if the column name listed is in the column names, FALSE otherwise
print('admit_rate' in projectData_mod.columns)
print('OPEID' in projectData_mod.columns)


     OPEID  admit_rate  collegeID  admit_rate_01
0   100200       89.65     100200         0.8965
1   105200       80.60     105200         0.8060
2  2503400         NaN    2503400            NaN
3   105500       77.11     105500         0.7711
4   100500       98.88     100500         0.9888
5   105100       80.39     105100         0.8039
6   100700         NaN     100700            NaN
7   831000       95.55     831000         0.9555
8   100900       85.07     100900         0.8507
9   101200       60.45     101200         0.6045
['collegeID', 'name', 'city', 'state', 'region', 'median_debt', 'default_rate', 'highest_degree', 'ownership', 'locale', 'hbcu', 'SAT_avg', 'online_only', 'enrollment', 'net_price', 'avg_cost', 'net_tuition', 'ed_spending_per_student', 'avg_faculty_salary', 'pct_PELL', 'pct_fed_loan', 'grad_rate', 'pct_firstgen', 'med_fam_income', 'med_alum_earnings', 'admit_rate_01']
False
False
