# Practice with Jupyter Notebooks and Basic Python Commands

**Overview:** For this assignment, you will practice working with Jupyter notebooks, navigate documentation for Python packages and functions, and read in the data you have selected for your class project. 

**Directions:** Please work through the notebook, answering all the questions. 
When you are done, export to a PDF file. Upload it to Gradescope using the following naming convention for your file: 

DSC201_602_SP24_codingassignment4_unityID

(For example, DSC201_602_SP24_codingassignment4_keporte2)

As a reminder, you can export to a html file within VS Code by clicking on the three dots ("...") to the right of "Outline" in the panel at the top of your screen. Then print your html to a PDF. If you have trouble exporting in VS Code, please upload your file to Jupyter Hub (https://jhub.cos.ncsu.edu/) by clicking on the up error over the horizontal line (2 icons to the right of the blue plus button). Then you can export to a PDF by going to "File" then "Save and Export Notebook As." 

**Points:** 25 points (plus 3 extra credit points)

**Due:** February 14 at 11:59 PM 

## The `NumPy` package

`NumPy` is a popular Python package used primarily for numerical calculations.  It is a powerful tool that handles large sets of numbers and mathematical operations efficiently. It's widely used in data science to process and analyze data. `NumPy`'s ability to perform complex calculations quickly makes it a foundational tool for other data science libraries. 

Uncomment and run the command below to install `NumPy`. Note you will need to retstart your kernel after you install it. In VS Code and on Jupyter Hub, there is a restart button in the top panel. You can comment out the installation command after you have successfully installed the package.

In [1]:
%pip install numpy

Note: you may need to restart the kernel to use updated packages.


Next, we import `numpy`. We shorten the imported name to np for better readability of code using NumPy. This is a widely adopted convention that makes your code more readable for everyone working on it. 

Then, `help(np)` brings up documentation. Go to the listed link of numpy.org and after opening it, click on "Documentation." Find the answers to the following questions: 

**(1) Similar to how R's `dplyr` has somewhat different versions of the data structures in base R, Python's `numpy` has somewhat different versions of the data structures in standard Python. For example, an "array" is a central data structure in `numpy` which can be compared to a "list" in standard Python. What is at least one difference between a "list" in `numpy` and an "array" in standard Python? [Enter your answer below, 1 POINT]**

**(2) In `numpy`, what is the relationship between a vector and array? What is relationship between a matrix and array? [Enter your answer below, 1 POINT]**

In [2]:
import numpy as np
# help(np)

**(3) In the following code chunk, write a line a code to accomplish each task described in the comments. Refer to documentation at numpy.org to guide you; it provides lots of examples that you can copy and modify. [6 POINTS TOTAL, 1 POINT EACH]**

In [3]:
# Create a one-dimensional NumPy array of numbers from 1 to 10.
numpy_1d = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) # numpy array because it's in brackets

# TODO: Print the first element of the array you just created.
print(numpy_1d[0])

# TODO: Find the length of the array.
print(len(numpy_1d))

# TODO: Find and print the mean of your array
print(sum(numpy_1d)/len(numpy_1d))

# TODO: Reshape your array into a matrix with 2 columns and 5 rows, assign this to a new object and print it (2 lines of code)
reshaped_numpy_1d = numpy_1d.reshape((5, 2))
print(reshaped_numpy_1d)

# TODO: Find and print the sum of the first column of your matrix
print(reshaped_numpy_1d.sum(axis = 0)[0]) 

# axis = 0 is summing for columns, axis = 1 is summing for rows; then we take the first value of that array [25, not 30]

1
10
5.5
[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]]
25


**(4) A valuable way to expand your coding skills is to run example code - from documentation, tutorials, other coders - and see what it accomplishes. Using the documentation at numpy.org, create a code chunk and test out three NumPy commands that either operate on the array or matix that you created above. After running the commands, use comments to explain what the command does and anything you think is important to remember about how the command is used, or something you do not understand. (3 POINTS FOR COMPLETION)**

In [4]:
numpy_1d = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

LCM_reshaped_numpy_1d = np.lcm.reduce(numpy_1d) # first function, least common multiple
print(LCM_reshaped_numpy_1d) # takes 1d array -> 0d array with no brackets (just numbers)

GCD_reshaped_numpy_1d = np.gcd.reduce(numpy_1d) # second function, greatest common denom
print(GCD_reshaped_numpy_1d) # takes 1d array -> 0d array with no brackets (just numbers)

mean_reshaped_numpy1d = np.mean(numpy_1d) # third function, mean
print(mean_reshaped_numpy1d) # takes 1d array -> 0d array with no brackets (just numbers)

2520
1
5.5


## Getting familiar with `pandas`


While NumPy is valuable for computations on arrays and matrices, we will work more with Pandas (which is built on top of NumPy), which is valuable for more complex data manipulation tasks on structured data. You should have already installed pandas when running Python-Notebook1. If not, uncomment and run the first line in the chunk below. In either case, run the import command below to load the Pandas library into your environment. 

In [5]:
# %pip install pandas

# Import the pandas library and assign the alias "pd"
import pandas as pd 





Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


#### Reviewing documentation

For `pandas` documentation, check out: https://pandas.pydata.org/docs/ (which is much more user friendly than what you get with help(pd).) Review the "package overview" and answer the following questions:

**(5) What are at least two things you can do with the `pandas` package that accomplish similar (or identical) tasks you learned with `dplyr` in R? [1 POINT]**

- Grouping/Aggregating -> dplyr: group_by() %>% summarize() + Pandas: df.groupby().agg
- Selecting Columns -> dplyr: select() + Pandas: df[['col1']]

**(6) What are the two main data structures in `pandas` and what are the parallel data structures in base R or `dplyr`? [1 POINT]**

- The main data structure in `pandas` are series (similar to vectors) and data frames. Similarly in base R, there are vectors (similar to series) and data frames. In dplyr, there are tibbles (a user-friendly version of data frames) and data frames.


**(7) Within this Pandas website, go to https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html#quick-reference. Use this "quick reference" to complete the following. [4 POINTS TOTAL, 1 POINT EACH]**

Note: The functions are preceded by `pandas` and a "." or more frequently and preferably, the alias you used when you imported it (pd). So "pd.functionname."


In [6]:
# First, I will create a toy data frame for you. 
df = pd.DataFrame({
    'ColA': [1, 2, 3],
    'ColB': [0.1, 0.2, 0.3],
    'ColC': [True, False, True]
})
print(df)

# TODO: Find the dimensions of the data frame df
num_rows, num_cols = df.shape

print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_cols}")

# TODO: Summarize all columns in the data frame df (Hint: for this one the () will be empty and you need to include "print()")
summary_stats = df.describe(include="all")
print(summary_stats)

# TODO: Create a reduced data frame (a new object) that is just the first two columns
reshaped_summary_stats = df.iloc[: ,0:2] #: everything (for rows)
print(reshaped_summary_stats)

# TODO: Run another command of your choice and add a comment describing what it does. Be sure to print the results. 
print(df.head(1)) # this code prints the first row of the data frame

   ColA  ColB   ColC
0     1   0.1   True
1     2   0.2  False
2     3   0.3   True
Number of Rows: 3
Number of Columns: 3
        ColA  ColB  ColC
count    3.0  3.00     3
unique   NaN   NaN     2
top      NaN   NaN  True
freq     NaN   NaN     2
mean     2.0  0.20   NaN
std      1.0  0.10   NaN
min      1.0  0.10   NaN
25%      1.5  0.15   NaN
50%      2.0  0.20   NaN
75%      2.5  0.25   NaN
max      3.0  0.30   NaN
   ColA  ColB
0     1   0.1
1     2   0.2
2     3   0.3
   ColA  ColB  ColC
0     1   0.1  True


For specific functions within a package, it can be quick and helpfl to pull up documentation from inside your Jupyter notebook. Using the example of `read_csv`:
 
 - help(pd.read_csv)

 This is a standard Python function that invokes the built-in help system. When you use help(pd.read_csv), it displays the documentation for the read_csv function in a more detailed, text-based format. This command is versatile and can be used in any Python environment, including a standard Python shell, scripts, and Jupyter notebooks. The output is typically displayed in the same area where the command was executed.

- pd.read_csv?

  This syntax is specific to Jupyter notebooks. 
  When you use pd.read_csv? in a Jupyter notebook, it displays the documentation in a separate pane or window at the bottom of the notebook interface. This pane can be resized, scrolled, or closed. The documentation displayed is generally more concise and is formatted for quick readability, focusing on the most essential aspects of the function.
  This method is more interactive and user-friendly, especially in a Jupyter notebook environment, but it is not available in standard Python shells or scripts.

**(8) Give these a try below, and provide 2 observations for each: [Enter responses below within this markdown chunk, 4 POINTS]**

In [7]:
# TODO: Uncomment the line below to test. 
# help(pd.read_csv) # This line of code tells you what the function in parantheses does. It also gives an example

# TODO: Uncomment the line below to test. 
# pd.read_csv? # This line of code will tell you all possible functionalities for the read_csv function. It also gives you an example of using the code.


**(9) Now, let's use `read_csv` to read in the datasests you have selected for your project. I am assuming you have .csv files. If not, then write a .csv file in R so that you can read it in here. After reading in your file, carry out the tasks described in the comments. [4 POINTS TOTAL, 1 POINT EACH]**

In [15]:
# Read in the file by editing the example below
# projectData = pd.read_csv('data/colleges.csv')
projectData = pd.read_csv('data/comma_survey.csv') 

# TODO: Display the dimensions of your data frame projectData
num_rows, num_cols = projectData.shape

print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_cols}")

# TODO: Summarize (describe) the columns in projectData
summary_projData = projectData.describe(include="all")
print(summary_projData)

# TODO: Display the first 10 rows of ProjectData and at the same time, 5 columns of your choice
summary_projData.iloc[:10,:5] # . is %>% in Python + :10 = first 10 rows, :5 = first five columns


Number of Rows: 1129
Number of Columns: 14


         Unnamed: 0  respondent_id  gender    age   household_income  \
count   1129.000000   1.129000e+03    1037   1037                836   
unique          NaN            NaN       2      4                  5   
top             NaN            NaN  Female  45-60  $50,000 - $99,999   
freq            NaN            NaN     548    290                290   
mean     565.000000   3.290127e+09     NaN    NaN                NaN   
std      326.058533   1.072966e+06     NaN    NaN                NaN   
min        1.000000   3.288376e+09     NaN    NaN                NaN   
25%      283.000000   3.289470e+09     NaN    NaN                NaN   
50%      565.000000   3.290114e+09     NaN    NaN                NaN   
75%      847.000000   3.290777e+09     NaN    NaN                NaN   
max     1129.000000   3.292954e+09     NaN    NaN                NaN   

              education location  \
count              1026     1027   
unique                5        9   
top     Bachelor degree  Pa

Unnamed: 0.1,Unnamed: 0,respondent_id,gender,age,household_income
count,1129.0,1129.0,1037,1037,836
unique,,,2,4,5
top,,,Female,45-60,"$50,000 - $99,999"
freq,,,548,290,290
mean,565.0,3290127000.0,,,
std,326.058533,1072966.0,,,
min,1.0,3288376000.0,,,
25%,283.0,3289470000.0,,,
50%,565.0,3290114000.0,,,
75%,847.0,3290777000.0,,,


**Extra Credit:**

**(1) Create a new object to hold a modified version of `projectData`. The modified version should (a) rename one variable, (b) drop one variable and (c) create one new variable that is a computation based on another variable in your data (e.g., it sums to variables together or it multiplies a variable by a number, etc. ). [1 POINT]**

**(2) Describe what you would do to check that you modifications worked. How would you make it easy to do the checks (rather than viewing lots of information you don't need). [1 POINT]**
- I can use df.head() and df.iloc() to either see selects rows/columns of data, or to slice the data according to different specifications.

**(3) Write code to implement your check [1 POINT]**


In [17]:
# PART ONE
mod_projectData = projectData.rename(columns=
                                     {'household_income': 'household income',
                                    'respondent_id': 'respondent id'}).drop(columns=['Unnamed: 0']).assign(genderAge=projectData['gender']+
                                    " " # adds a space for the phrase "gender age"
                                    +projectData['age'])
print(mod_projectData)

# PART THREE
print(mod_projectData.head(5))


      respondent id  gender    age     household income  \
0        3292953864    Male  30-44    $50,000 - $99,999   
1        3292950324    Male  30-44    $50,000 - $99,999   
2        3292942669    Male  30-44                  NaN   
3        3292932796    Male  18-29                  NaN   
4        3292932522     NaN    NaN                  NaN   
...             ...     ...    ...                  ...   
1124     3288387618  Female  18-29  $100,000 - $149,999   
1125     3288387379  Female  30-44    $50,000 - $99,999   
1126     3288382543  Female  30-44    $50,000 - $99,999   
1127     3288379152  Female  45-60    $50,000 - $99,999   
1128     3288375700    Male  18-29    $25,000 - $49,999   

                             education            location  \
0                      Bachelor degree      South Atlantic   
1                      Graduate degree            Mountain   
2                                  NaN  East North Central   
3         Less than high school degree     