## Homework 1: Importing Datasets and Playing with Pandas

Data Science is focused on using tools to access, analyze, and find meaningful results in data.  Now that we know some basic Python programming, we are ready to do begin to use our first dataset!

### CSV File Format

One widely used file format to store datasets is as **CSV** or **Comma-Separated Values**.  The CSV file maps extremely well between a spreadsheet program (ex: Microsoft Excel, Google Sheets) and programming languages.  Specifically:

- Every row on a spreadsheet is a line in a CSV file
- Every column on a spreadsheet is separated by a comma in a CSV file

### Download the UIUC Course Catalog as a CSV

Let's grab the first dataset we will use!  For this dataset, we have already collected and put the data into a Google Sheet for you.  You will need to download the file as a CSV and place it in the same directory as this Python notebook.

1. Open the Google Sheet for the "UIUC Course Catalog Dataset": https://docs.google.com/spreadsheets/d/1Rib9J9ty_ChZWaszuO6dnPG3Lc1Muxq_G4HnpD1UeeA

2. Choose **File** -> **Download As** -> **Comma-separated Values (.csv, current sheet)**

3. Find the downloaded file on your computer and move it into the same directory as this notebook.
  * The file is likely in your Downloads as `UIUC Course Catalog Dataset - Sheet1.csv`
  * You notebook is likely on your Desktop under `cs296 -> [netid] -> hw1`
  * You need to ensure the downloaded file is in the `hw1` folder before running the next cell.

### Using Pandas to Read CSV Files

In Data Science, we will use the `pandas` **library** for working with data.  Libraries are collections of functions and other objects that extend what Python can do by itself.  The `pandas` library is the most widely used library for data science.

To use the `pandas` library, we must **import** it into your notebook.  In future labs, you will find this will be one of the first lines we will run in every notebook:

In [1]:
import pandas as pd

To use a function inside of the pandas library, the function name will always start with `pandas.` to denote that Python can find it within the `pandas` library.  The first function we will use from the `pandas` library is `read_csv`.
- Because `read_csv` is in the pandas library, the full function name is `pd.read_csv()`.

### Puzzle 1:

The `pd.read_csv` function's first parameter is the name of the file that it is reading.  Set the variable `df` to be the result of the `pandas.read_csv` function:

In [3]:
df = pd.read_csv("UIUC Course Catalog Dataset - Sheet1.csv")

### Accessing a Column

The dataset you downloaded has six columns: `Year`, `Term`, `YearTerm`, `Subject`, `Number`, and `Title`.  We can access a column by **indexing into the DataFrame**.  Whenever we **index into** something, we will use matchingsquare brackets: `[` and `]`.  For example, to **index into the column `Subject`**:

In [5]:
df['Subject']

0        AAS
1        AAS
2        AAS
3        AAS
4        AAS
5        AAS
6        AAS
7        AAS
8        AAS
9        AAS
10       AAS
11       AAS
12       AAS
13       AAS
14       AAS
15       AAS
16       AAS
17       AAS
18       AAS
19       AAS
20       AAS
21       AAS
22       AAS
23       AAS
24       AAS
25       AAS
26       AAS
27       AAS
28       AAS
29       AAS
        ... 
8712      VM
8713      VM
8714      VM
8715      VM
8716      VM
8717      VM
8718      VM
8719      VM
8720      VM
8721      VM
8722      VM
8723      VM
8724    WLOF
8725    WLOF
8726    WLOF
8727    WLOF
8728    WLOF
8729    WLOF
8730    WLOF
8731    WLOF
8732    WRIT
8733    WRIT
8734    WRIT
8735    YDSH
8736    YDSH
8737    YDSH
8738    YDSH
8739    YDSH
8740    YDSH
8741    YDSH
Name: Subject, Length: 8742, dtype: object

In [8]:
print(df)

      Year    Term YearTerm Subject  Number  \
0     2019  Spring  2019-sp     AAS     100   
1     2019  Spring  2019-sp     AAS     105   
2     2019  Spring  2019-sp     AAS     120   
3     2019  Spring  2019-sp     AAS     199   
4     2019  Spring  2019-sp     AAS     200   
5     2019  Spring  2019-sp     AAS     201   
6     2019  Spring  2019-sp     AAS     211   
7     2019  Spring  2019-sp     AAS     215   
8     2019  Spring  2019-sp     AAS     224   
9     2019  Spring  2019-sp     AAS     246   
10    2019  Spring  2019-sp     AAS     250   
11    2019  Spring  2019-sp     AAS     258   
12    2019  Spring  2019-sp     AAS     260   
13    2019  Spring  2019-sp     AAS     265   
14    2019  Spring  2019-sp     AAS     275   
15    2019  Spring  2019-sp     AAS     281   
16    2019  Spring  2019-sp     AAS     283   
17    2019  Spring  2019-sp     AAS     286   
18    2019  Spring  2019-sp     AAS     287   
19    2019  Spring  2019-sp     AAS     288   
20    2019  S

### Puzzle 2:

Find the column name and list the **name of every course** (ex: 'Data Structures') at Illinois.

- You may need to refer back to the output of the DataFrame to get the correct column name.

In [7]:
df['Title']

0                  Intro Asian American Studies
1         Introduction to Arab American Studies
2                 Intro to Asian Am Pop Culture
3                    Undergraduate Open Seminar
4                          U.S. Race and Empire
5                   US Racial & Ethnic Politics
6                  Asian Americans and the Arts
7                  US Citizenship Comparatively
8                 Asian Am Historical Sociology
9                  Asian American Youth in Film
10                 Asian American Ethnic Groups
11                           Muslims in America
12                 Intro Asian American Theatre
13                          Politics of Hip Hop
14                      The Politics of Fashion
15                 Constructing Race in America
16                       Asian American History
17                    Asian American Literature
18                     Food and Asian Americans
19                   Global Islam and Feminisms
20                             Individua

### Finding a Subset of the Data

Let's find a subset of the data.  To do this, we must **index into the DataFrame** with a **conditional statement**.

### Puzzle 3:

Write the conditional within a DataFrame's index to output a DataFrame that contains all of the `CS` courses.

In [16]:
print(df.loc[df['Subject']=='CS', ['Title']])

                                                  Title
2399                               Freshman Orientation
2400                       Intro Computing: Engrg & Sci
2401                           Little Bits to Big Ideas
2402                          Intro Computing: Non-Tech
2403                          Intro to Computer Science
2404                             Software Design Studio
2405                                Discrete Structures
2406                                    Freshman Honors
2407     Undergraduate Open Seminar in Computer Science
2408                      Ethical & Professional Issues
2409                                    Data Structures
2410                              Computer Architecture
2411                   Introduction to Computer Systems
2412                                 System Programming
2413                                 Programming Studio
2414                                      Honors Course
2415                                Numerical Me

## Submit Your Work!

You're almost done -- congratulations!

You need to do two more things:

1. Save your work.  To do this, create a **notebook checkpoint** by using the menu within the notebook to go **File -> Save and Checkpoint**

2. Choose `File` and then `Close and Halt` from this notebook.

3. Choose `Quit` on the main notebook webpage.

4. Return to your command line and follow the directions on the honors webpage on how to use git to turn this notebook into the course!