# Pandas 
Pandas is a library for python that allows the analyst to `get to know` the data they are working on. The library name itself is short for `Python Analysis-Data` or if properly referenced `Python Data Analysis Library`. Pandas is also made on top of `Numpy`. Some `Numpy` operations are incorporated in Pandas, which you will see later on.

Datasets come in many different forms. The most common are:
1. .csv files
2. .xlsx files
3. .txt files

Lesser common are:
1. .tsv files
2. .zip files

These two files are compressed files. Pandas can decompress these and access the data inside it. 

## Essential Pandas Functions
Understanding your data is essential. Thus, Pandas has several functions for digging and discovering insights on data on a very high level. Afterwards, you should know how to select specific data points and filter the dataset in question. Thus, this notebook will guide you through the following:

1. Getting to know your data
2. Filtering Data
3. Sorting Data
4. Basic Statistics

In [1]:
import numpy as np
import pandas as pd

## 1. Getting to Know Your Data

Kindly download the dataset located in this link: https://www.kaggle.com/openfoodfacts/world-food-facts/data. Put in the same folder as this notebook.

### Task 1
As you can see, the file is in the `.tsv` format. Though different, this still can be loaded via `pd.read_csv` command. Note though that read_csv might not work on other file formats. Be careful! For more information on the `pd.read_csv` function, check out: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [None]:
# Load the .tsv file with the `sep` paramater.


### Task 2
Often times, datasets will be much larger than you expect, and running them might stress your computer system. To get some sort of a sneak peak of your data, you may run the `.head()` function. This outputs the first 5 data entries.

In [None]:
# See the first 5 entries of the dataset.


### Task 3
How will you know the number of rows and columns of your dataset? As a data science lingo, `checking the number of observations` means counting how many rows/data points your dataset has.

So back to the question: how to know the number of rows and columns? You do this via the `.shape` function.


In [None]:
# Check the shape of your dataset.


How many observations does your data set have? How about columns?

### Task 4
What are these columns? To check the names of these columns, simply run the `.columns` command.

In [None]:
# Print the name of the columns of the dataset.


In [None]:
# Selecting specific columns using iloc


To check the name of a specific column index, this can be achieved the same way lists are indexed. Same method is used to get a specific data point.

In [None]:
# What is the name of the 160th column


In [None]:
# What is the product name of the 35th observation?


## 2. Filter and Sorting Data
For this part, we will be pulling the dataset from this url: 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'

### Task 1

In [None]:
# Import the dataset from the link above. Hint: use pd.read_csv


### Task 2
This data seems to be not in its optimal shape. For one, some columns cannot be used for analysis. Specifically the `item_price` column. There is a `$` sign at the beginning of each number. In data science, all numbered data should be read as numbers. In this case, if it has these unnecessary characters, we need to find a way to remove it. 

In [None]:
# Clean the `item_price` column


Duplicates often cannot be avoided. This is often the result of human-error, or in the collection algorithms used to make the dataset. Either way, duplicates should be deleted.

In [None]:
# Delete the duplicates.



### Task 3
Sorting the data is an easy function with pandas using the `.sort_values()` function.

In [None]:
# Sort values from the most expensive to the less expensive


You think you can use this function with the `item_name` column?

### Task 4
Filtering is one of the essentials skills you need to harness when using Pandas. You will be using skills you have learned from basic python.

In [None]:
# Select only products with 1 quantity.


In [None]:
# Select all Chicken Bowls


In [None]:
# How many times was a veggie salad bowl ordered?


In [None]:
# How customers ordered more than one Canned Soda.

## 3. Basic Statistics 
Data analysis won't be complete without numerical analysis, as this is the core of data science. Some `pandas` data analysis functions use `numpy`

### Task 1 
Load the baby names dataset from the url https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv

In [None]:
# Load the csv from the url.


In [None]:
# See the first 10 entries.


In [None]:
# Delete the column `Unnamed: 0` and `Id` columns


### Task 2
Checking the values of specific datapoints can be done easily using `.value_counts()` function. As in the name of the function, it counts the values of the features you specify. Before you run this though, you have to select the column.

In [None]:
# How many male and female names are in the dataset?


### Task 3
`groupby` is a very common function in pandas. It allows you to group similarly data points and aggregate it based on a function of your choosing.

In [None]:
# Get the count of each names.


### Task 4

In [None]:
# What is the name of the most occurrences?


### Task 5

In [None]:
# What are the average count of the names?


### Task 6

In [None]:
# Print all summary statistics.


# Matplotlib
COPYRIGHT: AIM MSDS 2021

## What is Matplotlib?

Matplotlib is a plotting library that is used with Python. There are different libraries in Python and Matplotlib is just one of the most amazing llibrary for plotting! This si going to be your best friend. So, let's get started!

## General Concepts about Matplotlib

A matplotlib figure is subdivided into different parts. 

(1) **FIGURE** A figure is referred to as the whole figure which can contain multiple plots (aka charts)

(2) **AXES** A figure can contain many axes. Think of axes as like a plot. Each axes has a title, an x-label, and a y-label

(3) **AXIS** Do not confuse AXES with AXIS. Axis are your X and Y axis (for 2D Plots)

(4) **ARTIST** Everything that you see on the figure which includes text, line2D, etc. 

Here is an anatomy of a figure taken from: [This is the link.](https://matplotlib.org/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py)

![alt text](https://matplotlib.org/_images/anatomy.png)

## Let's Get Started with a Simple Plot


**PYPLOT** is a module in Matplotlib that provides simple functions to add plot elements inclluding lines, text, images in the figure. So now, let's import the library. 

Note: You may wonder what is **%matplotlib inline**?


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline 

In [None]:
# Let's start with a basic plot

x = [0, 1, 2, 3, 4, 5]
y = [0 ,10, 20, 30, 40, 50]

#We use plot() method and show() method to createthhe required plot
plt.plot(x, y) 
plt.show()

We use **title( )**, **xlabel( )**, and **ylabel( )** to add a title, name in the x-axis, and in the y-axis.

In [None]:
plt.plot([0, 1, 2, 3, 4, 5], [0, 10, 20, 30, 40, 50])
plt.title('First Plot', fontsize=14)
plt.xlabel('X Label', fontsize=12)
plt.ylabel('Y Label', fontsize=12)
plt.show()

We can also specify the size of the figure usingg the method **figure ( )** and passing the value of row and column size as a type to the argument **figsize**.

In [None]:
plt.figure(figsize=(15, 5))
plt.plot(x, y)
plt.show()

The standard format when plotting is a solid blue line. We can change the plot such that each point is represented with a marker instead. 

* 'o' - represents a circle
* '^' - represents a triangle

We can type it as an argument **marker='o'**. [More Markers Here](https://matplotlib.org/3.1.1/api/markers_api.html)

In [None]:
plt.figure(figsize=(15, 5))
plt.plot(x, y, marker='o')
plt.show()

**What if I don't want the line and I just want to retain markers?**

There are two ways to do this. 

There is a method linestyle where you can set it as blank ''. You can also set this as:

* '' - blank
* '--' - dashed
* ':' dotted

In [None]:
plt.figure(figsize=(15, 5))
plt.plot(x, y, marker='o', linestyle='')
plt.show()

In [None]:
#A shortcut
plt.figure(figsize=(15, 5))
plt.plot(x, y, 'o')
plt.show()

The color can be changed using the method color. [Named Colors](https://matplotlib.org/3.1.1/gallery/color/named_colors.html#sphx-glr-gallery-color-named-colors-py)

In [None]:
plt.figure(figsize=(15, 5))
plt.plot(x, y, 'o', color='red')
plt.show()

In [None]:
#A shortcut to change markers to red. Add a base color in the marker type. 
#You can use b: blue, g: green, r: red, c: cyan, m: magenta, b: black, y: yellow
plt.figure(figsize=(15, 5))
plt.plot(x, y, 'ro')
plt.show()

## Let's deal with real data!

We are going to use a Car Sales Data Set

Download **"Car_sales.csv"** in this [link](https://www.kaggle.com/gagandeep16/car-sales/data#)

In [None]:
#import dataset
import pandas as pd
data = pd.read_csv("Car_sales.csv")

In [None]:
#Check our Data
data.head()

In [None]:
data.shape

In [None]:
#Getting Descriptive Statistics through the arguement describe()
data.describe()

## Scatter Plot Introduction

We can now try to plot using a scatter plot to look at Engine Size and Horse Power. What can you say about the plot below?

In [None]:
plt.figure(figsize=(10,5))
plt.scatter(data['Engine_size'], data['Horsepower'])
plt.xlabel('Engine Size')
plt.ylabel('Horsepower')
plt.title('Understanding Engine Size and Horsepower of Cars')
plt.show()

Practice
---

Try to plot the folowing:

(1) Fuel Efficiency and Horsepower 

(2) Fuel Efficiency and Curb Weight

Is there any insights you can get from the plots?

In [None]:
# Fuel Efficiency and Horsepower

In [None]:
# Fuel Efficiency and Curb Weight

## Bar Charts Introduction


Now let's try to plot bar charts. Suppose we want to understand which manufacturer has the most sales. First we have to group our dataframe based on Sales for each of their model. 

In [None]:
sales = pd.DataFrame({'total_sales' : data.groupby( ["Manufacturer"] )['Sales_in_thousands'].
                             sum()}).reset_index().set_index(['Manufacturer'])
sales.head()

In [None]:
sales.shape

In [None]:
plt.figure(figsize=(15,5))
plt.bar(sales.index, sales['total_sales'])
plt.title('Understanding Manufacturers with Biggest Sales')
plt.xlabel('Manufacturer')
plt.ylabel('Sales in Thousands ($)')
plt.xticks(rotation=45)
plt.show()

This may not be helpful, what if I want to display the top 10 manufacturer only?

In [None]:
top10_brand = sales.sort_values('total_sales', ascending=False)[:10]
plt.figure(figsize=(10,5))
plt.bar(top10_brand.index, top10_brand['total_sales'])
plt.title('Top 10 Manufacturers with Biggest Sales')
plt.xlabel('Manufacturer')
plt.ylabel('Sales in Thousands ($)')
plt.show()

Customizing Grids and Ticks
======

In [None]:
#Adding Grid
plt.figure(figsize=(10,5))
plt.bar(top10_brand.index, top10_brand['total_sales'])
plt.title('Top 10 Manufacturers with Biggest Sales')
plt.xlabel('Manufacturer')
plt.ylabel('Sales in Thousands ($)')
plt.grid() #Adding the grid method activates gri
plt.show()

**FROM HELP**

b : bool or None, optional
    Whether to show the grid lines. If any *kwargs* are supplied,
    it is assumed you want the grid on and *b* will be set to True.

    If *b* is *None* and there are no *kwargs*, this toggles the
    visibility of the lines.

which : {'major', 'minor', 'both'}, optional
    The grid lines to apply the changes on.

axis : {'both', 'x', 'y'}, optional
    The axis to apply the changes on.

In [None]:
#Changing Grids
plt.figure(figsize=(10,5))
plt.bar(top10_brand.index, top10_brand['total_sales'])
plt.title('Top 10 Manufacturers with Biggest Sales')
plt.xlabel('Manufacturer')
plt.ylabel('Sales in Thousands ($)')
plt.grid(color='green', linestyle='--', linewidth=0.5, axis='y')
plt.show()

In [None]:
#Changing Labels in XTICKS
import numpy as np

plt.figure(figsize=(10,5))
plt.bar(top10_brand.index, top10_brand['total_sales'])
plt.title('Top 10 Manufacturers with Biggest Sales')
plt.xlabel('Manufacturer')
plt.ylabel('Sales in Thousands ($)')
plt.grid(color='green', linestyle='--', linewidth=0.5)

label=['FD', 'DE', 'TA', 'HA', "CT", "NN", "CC", "JP", "BK", 'MY']
plt.xticks(np.arange(10), labels=label)
plt.show()

In [None]:
#Disable XTICKS
#Changing Labels in XTICKS
plt.figure(figsize=(10,5))
plt.bar(top10_brand.index, top10_brand['total_sales'])
plt.title('Top 10 Manufacturers with Biggest Sales')
plt.xlabel('Manufacturer')
plt.ylabel('Sales in Thousands ($)')
plt.grid(color='green', linestyle='--', linewidth=0.5)

plt.xticks([])
plt.show()

In [None]:
top10_brand

In [None]:
#Create a horizontal bar chart with argument barh 
plt.figure(figsize=(10,5))
plt.barh(top10_brand.index, top10_brand['total_sales'])
plt.title('Top 10 Manufacturers with Biggest Sales')
plt.xlabel('Manufacturer')
plt.ylabel('Sales in Thousands ($)')
plt.show()

In [None]:
#Inverting Axis
#Create a horizontal bar chart with argument barh 
plt.figure(figsize=(10,5))
plt.barh(top10_brand.index, top10_brand['total_sales'])
plt.title('Top 10 Manufacturers with Biggest Sales')
plt.xlabel('Manufacturer')
plt.ylabel('Sales in Thousands ($)')
plt.gca().invert_yaxis()
plt.show()

## Histogram Introduction

In [None]:
#because I tend to forget what's in my dataset. Let's review again
data.head()

In [None]:
plt.figure(figsize=(15,5))
plt.hist(data['Manufacturer'])
plt.title('Understanding Manufacturers with Most Number of Cars in the Dataset')
plt.xlabel('Manufacturer')
plt.ylabel('Number of Models')
plt.xticks(rotation=45)
plt.show()

Practice
---
Can you create a histogram based on vehicle type?

In [None]:
plt.figure(figsize=(15,5))
plt.hist(data['Manufacturer'])
plt.title('Understanding Manufacturers with Most Number of Cars in the Dataset')
plt.xlabel('Manufacturer')
plt.ylabel('Number of Models')
plt.xticks(rotation=45)
plt.show()

## Boxplot

In [None]:
plt.boxplot(data['Horsepower'])
plt.show()

We get an error! Why? Well, let's see the unique values of Horsepower.

In [None]:
data.isnull().sum()

In [None]:
data['Horsepower'].unique()

There are nan values. This is because we didn't clean the data. Let's cleanup and remove the NA values.

In [None]:
data.shape

In [None]:
data.dropna(inplace=True)
data.shape

In [None]:
plt.boxplot(data['Horsepower'])
plt.show()

In [None]:
#Change Outlier. Note: 0 is for notch
plt.boxplot(data['Horsepower'], 0, 'r^')
plt.show()

In [None]:
# horizontal boxes
plt.figure()
plt.boxplot(data['Horsepower'], 0, 'rs', 0)
plt.show()

## The Famous Pie Charts

In [None]:
data.head()

In [None]:
data.Vehicle_type.hist(grid=False)

In [None]:
vtype = data.groupby('Vehicle_type').size()
vtype

In [None]:
#Let's get the keys and values to plot in a pie chart

tot_car = vtype.values.tolist()
vtype_val = vtype.keys().tolist()
print(vtype_val)
print(tot_car)

In [None]:
plt.pie(tot_car, labels=vtype_val)
plt.title('Number of Cars Based on Vehicle Type')
plt.show()

In [None]:
# Let's add percentage yung autopct and start the pie chart at angle 90 degrees
plt.pie(tot_car, labels=vtype_val, startangle=90, autopct='%.1f%%')
plt.title('Number of Cars Based on Vehicle Type')
plt.show()

## For Practice!

Explore the `car.txt` dataset, check if there are nulls, and generate visualizations.  

Provide your insights.