In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("pandasIntro.ipynb")

## Lecture Section

We will practice working with different Python libraries, packages, and modules in the next several lectures. It is impossible to teach every function, class, and tool available to you from all that we will cover. Instead, we will focus on the most important, common, and useful aspects. The documentation for the library/package/module and other resources will be shared at the end of each lecture.

In this lecture, we will cover:
* The differences between libraries, packages, and modules.
* `pandas` library, including:
    * Series & DataFrames
    * Reading data into a DataFrame
    * Gaining basic information from a dataset

### Libraries vs. Package vs. Module

Before we dig in, let's talk about the difference between a module, package, and library.

* **Modules** are `.py` files that contain functions, classes, and other code.
* **Packages** are collections of modules.
* **Libraries** are collections of packages.

In Python, **library** and **package** are often used interchangeably. You can make your own, but we will only be working with packages already available to you. We will also work with a few modules. The packages we will use are kept up to date by the package owner, and they are all documented online. The modules come built-in to your Python install.


### pandas

pandas is an open-source library for data analysis and manipulation in Python. To use it, we need to `import` it. To save some time, we will import pandas with the alias **pd**

In [None]:
import pandas as pd

There are two important data structures that are specific to pandas: series and dataframe.

**Series:** are similar to columns in a table. They are structured like a Python list, and they can have a header (column-name). They also have an index, like the line numbers in Excel. We can change the indexing, too.

**Dataframes** are created from series. They are the most common data-structure you will see. They create a matrix/table of series.

In [None]:
a = [1, 2, 3, 4]
pd_series = pd.Series(a)
pd_series

We can change the indexing with the `index` argument, and we can change the column name with `name`.

In [None]:
pd_series = pd.Series(a, index=["a", "b", "c", "d"], name = "column1")
pd_series

If we want to access a specific value at a specific index, we would just index into the series like we would a dictionary (or list).

In [None]:
pd_series['c']

To create a dataframe, we use `{}` to create a dictionary object. The keys are the name of the columns, and the lists that follow hold the data for the respective column.

To convert to a pandas DataFrame object, we use `pd.DataFrame()` and we pass our dictionary as the argument.

In [None]:
data_ex = {
  "mph": [70, 85.3, 65.5],
  "duration": [50, 40, 45],
    "vehicle": ["car", "truck", "bus"]
}

dataframe_ex = pd.DataFrame(data_ex)

dataframe_ex

Dataframes need to have the same number of rows for each column. If data is missing, use `None`. It will be converted to `NaN` in the Dataframe.

In [None]:
data_ex2 = {
  "mph": [70, 85.3, 65.5],
  "duration": [50, None, 45],
    "vehicle": ["car", "truck", "bus"]
}

dataframe_ex2 = pd.DataFrame(data_ex2)

dataframe_ex2

We can use `.loc[]` to return a row of data as a Series. Don't use `.loc()` - it will return the memory location of the object.

In [None]:
dataframe_ex.loc[1]

In [None]:
dataframe_ex.loc[[0, 1]] # a range of rows

In [None]:
dataframe_ex.loc[1, "vehicle"] # the rows & specific column

We can change indexes in Dataframes, too.

In [None]:
dataframe_ex3 = pd.DataFrame(data_ex2, index=["a", "b", "c"])
dataframe_ex3

In [None]:
dataframe_ex3.loc['b'] # .loc[] again!

Possibly the most useful aspect of Pandas is its ability to read a file into a DataFrame object.

In [None]:
df = pd.read_csv('data/disney_princess_popularity_dataset_300_rows.csv')
df

We can use `.head()` and `.tail()` to get the first or last 5 rows of the dataset, respectively. If we pass an integer as an argument, it will return that many rows instead.

In [None]:
df.head()

In [None]:
df.tail(3)

Finally, we can gather basic information of our dataset by calling `.info()`.

It shows:
* The number of entries, and the range of the indexes.
* The columns and their index, name, number of non-None (NULL/Nan) values, and type.
* The count of the types
* The memory usage

In [None]:
df.info()

## Assignment Section

**Question 1.**

For this problem, you are going to create a `pandas` dataframe called `data_ex`. It needs to have a `sales` (float), `quarter` (int), and `item` (string) column. There should be an equal number of data in each column, and there should at least 5 rows.

In [None]:
import pandas as pd
data = ...








In [None]:
grader.check("q1")

**Question 2.**

We will borrow a dataset from Kaggle for this exercise. I have edited it for the purposes of this course.
https://www.kaggle.com/datasets/msjahid/colorado-motor-vehicle-sales-data

For this problem, you will use `pandas` to read the `"data/colorado_motor_vehicle_sales.csv"` file. Find the basic info of the dataframe, then fill in the variables with the appropriate response.

* `rows =` : give the number of rows
* `null_sales =` : give the number of null rows in the sales column
* `columns =` : give the number of columns
* `year =` : give the data-type of the column `year` (as a string or as the type object)

In [None]:
df = ...

rows = ...
null_sales = ...
columns = ...
year = ...

In [None]:
grader.check("q2")

**Question 3.** Using the dataframe you created in Question 2, create a new dataframe that consists of the old dataframe's first 5 and last 5 rows. Set `ignore_index=True` if using `.concat()`. Do not hard-code the answer.

In [None]:
import pandas as pd
new_df = ...


In [None]:
grader.check("q3")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()