# Intro to Pandas

So far you have learned about some useful programs for manipulating and working with number data. Both numpy and matplotlib help you do different kinds of math more efficiently on data. 

However, in Machine Learning, you will also need to deal with large datasets. These datasets are usually organized into **rows** and **columns**. 

For example, look at this table of iD Tech students: 

![image](https://i.imgur.com/y1QneYd.png)

Each row has a different student's information, and each column has a different category of information. 

Pandas is a library that helps organize and understand information like this into dataframes. These dataframes are kind of like lists or 2d arrays that you see in Python, but they are much more advanced. Here are some of these differences: 

*  Dataframes can have all different kinds of information stored in them. All the information in an array or list has to be the same datatype (all integers or all strings).

* Dataframes allow you to label each column so you know what the information is about. 

* Dataframes are always a table with rows and columns, and arrays can have many different dimensions. 

* Dataframes are optimized for cleaning and sorting data, where arrays are usually set up to do large mathematic operations on the data inside them. 

Now that you know why dataframes are useful for Machine Learning projects, you are going to do some basic functions for them. 

1. Run the cell below to see how a dataframe can be created. 

In [1]:
import pandas as pd

my_dataframe = pd.DataFrame(
    {
        "Name": ("Rei", "Alexis", "Riley", "Anna"), 
        "Favorite Color": ("Green", "Yellow", "Purple", "Orange"), 
        "Age": (13, 15, 8, 10)
    }
)

print(my_dataframe)

     Name Favorite Color  Age
0     Rei          Green   13
1  Alexis         Yellow   15
2   Riley         Purple    8
3    Anna         Orange   10


As you can see making a dataframe is very straightforward. You can manually input different data. 

2. Edit the dataframe in the cell below to add a column for Favorite Candy. 

In [2]:
my_dataframe = pd.DataFrame(
    {
        "Name": ("Rei", "Alexis", "Riley", "Anna"), 
        "Favorite Color": ("Green", "Yellow", "Purple", "Orange"), 
        "Favorite Candy": ("Sour Patch", "Snickers", "Kit Kat", "Butterfinger"), 
        "Age": (13, 15, 8, 10), 
      # Add favorite candy here:

    }
)

print(my_dataframe)

     Name Favorite Color Favorite Candy  Age
0     Rei          Green     Sour Patch   13
1  Alexis         Yellow       Snickers   15
2   Riley         Purple        Kit Kat    8
3    Anna         Orange   Butterfinger   10


## Viewing Data

As you can see you can print out an entire dataframe using the print function. However, imagine that you have a dataframe with thousands of rows in it. Printing out that whole dataframe would be an issue. 

Pandas has two functions to help with this: **head** and **tail**. 

* `head()` will show you the first five rows of data.

* `tail()` will show you the last five rows of data.

1. In the cell below, there is a test dataset filled with nonsense data. Add a line to call the `head()` function on `test_data` using dot notation. 

2. Run the cell to see the first five rows of data. 

<details><summary>Click for final code</summary>

```
test_data.head()
```
</details>



In [4]:
test_data = pd.util.testing.makeDataFrame()
# Call head() here
test_data.head()

Unnamed: 0,A,B,C,D
WB2snx3lRL,-0.924482,0.250984,-0.084029,0.236251
7YzfdQ4drf,-0.67136,-0.691751,0.203586,-0.18718
3A4WxVSzId,-0.906549,-0.236316,0.909435,0.497683
zRNCyCzFFe,-1.306793,0.116482,-0.562232,0.364001
54YGb8IPvz,-0.82935,-0.154585,-3.200939,0.019862


3. Call the tail function on test_data in the cell below. 
4. Run the cell to see the last 5 rows of test_data.

In [5]:
# Try tail() here 
test_data.tail()

Unnamed: 0,A,B,C,D
ze7SRZJyqE,-1.014231,0.165702,-0.901956,1.508195
PYIidAwNrW,-0.061152,-0.12991,1.203504,-0.063258
6zmgX3wD6z,0.113011,1.063493,-0.299944,-0.684222
pBwFzUWwA3,-0.089843,2.086158,1.708267,-0.973638
eCBT19Rws2,-0.073094,1.968683,0.192948,0.285239


## Manipulating dataframes

Now that you know how to view data, you can drop or add columns from a dataframe. 

1. Run the code below to see how to drop the A column in your test dataset. 

In [6]:
test_data = test_data.drop(['A'], axis=1)

test_data.head()

Unnamed: 0,B,C,D
WB2snx3lRL,0.250984,-0.084029,0.236251
7YzfdQ4drf,-0.691751,0.203586,-0.18718
3A4WxVSzId,-0.236316,0.909435,0.497683
zRNCyCzFFe,0.116482,-0.562232,0.364001
54YGb8IPvz,-0.154585,-3.200939,0.019862


2. In the cell below, remove column 'C'. 
3. Use the `head()` function to see your changes. 

In [7]:
test_data = test_data.drop(['C'], axis=1)

test_data.head()

Unnamed: 0,B,D
WB2snx3lRL,0.250984,0.236251
7YzfdQ4drf,-0.691751,-0.18718
3A4WxVSzId,-0.236316,0.497683
zRNCyCzFFe,0.116482,0.364001
54YGb8IPvz,-0.154585,0.019862


### Importing Data

Another useful things to do with pandas is to import data into a dataframe so that you can manipulate it. You can read data from any online open source location. 

For example, here is a dataset about many different pengiuns. 

1. Add a data.head() statement to see the first five rows of the penguin data.
2. Run the cell to import and see the data.

In [14]:
penguins = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv')
# Add your head() call here.
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


### Dropping Useless Rows

When you import data like this, sometimes you get bad rows where there isn't useful data. You can use a function called `dropna()` to get rid of bad data. 

1. In the cell below, use dot notation to call `dropna()` on the penguins dataframe to drop the rows where at least one element is mising. 
2. Add a head call to see the changes.

In [13]:
# Use dropna here
penguins = penguins.dropna()
# Add code to show the first 5 rows here
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE


## Selecting Data

You might also be in a situation where you want to create a dataset with only some of the data in it. In this case, you can select different parts of the data. 

In order to do this you can use the name of a column in square brackets to show that column. For example, to see the island that each penguin is from, use this code: 

```
penguins['island']
```

1. Add code in the cell below to see all penguins species.

In [15]:
# Add code here
penguins['species']


0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
339    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 344, dtype: object

You can also select a series of rows using the ` : `. For example, run the cell below to see row 14 through 23.

In [16]:
# Selecting rows 14 through 23
penguins[14:23]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
14,Adelie,Torgersen,34.6,21.1,198.0,4400.0,MALE
15,Adelie,Torgersen,36.6,17.8,185.0,3700.0,FEMALE
16,Adelie,Torgersen,38.7,19.0,195.0,3450.0,FEMALE
17,Adelie,Torgersen,42.5,20.7,197.0,4500.0,MALE
18,Adelie,Torgersen,34.4,18.4,184.0,3325.0,FEMALE
19,Adelie,Torgersen,46.0,21.5,194.0,4200.0,MALE
20,Adelie,Biscoe,37.8,18.3,174.0,3400.0,FEMALE
21,Adelie,Biscoe,37.7,18.7,180.0,3600.0,MALE
22,Adelie,Biscoe,35.9,19.2,189.0,3800.0,FEMALE


2. Write code to show rows 4 through 15

In [18]:
# Add your code here:
penguins[4:15]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,FEMALE
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,MALE
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,
10,Adelie,Torgersen,37.8,17.1,186.0,3300.0,
11,Adelie,Torgersen,37.8,17.3,180.0,3700.0,
12,Adelie,Torgersen,41.1,17.6,182.0,3200.0,FEMALE
13,Adelie,Torgersen,38.6,21.2,191.0,3800.0,MALE


### Sorting

You can also sort data by it's value. This can be useful if you want to find specific rows of data, or if you want the data in a certain order for some reason. 

1. In the cell below, there is code to sort the data by flipper length. Add code to show the first five rows of data. 


In [19]:
# Sort penguines by flipper length
penguins = penguins.sort_values(by="flipper_length_mm", ascending=False)

# Add code to print the first few rows:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
283,Gentoo,Biscoe,54.3,15.7,231.0,5650.0,MALE
333,Gentoo,Biscoe,51.5,16.3,230.0,5500.0,MALE
335,Gentoo,Biscoe,55.1,16.0,230.0,5850.0,MALE
285,Gentoo,Biscoe,49.8,16.8,230.0,5700.0,MALE
295,Gentoo,Biscoe,48.6,16.0,230.0,5800.0,MALE


2. In the cell below, sort the data by body mass. 
3. Then show the last five rows of data.

In [22]:
# Sort penguins by body mass
penguins = penguins.sort_values(by="body_mass_g", ascending=False)
# Show the last five rows of data
penguins.tail()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
58,Adelie,Biscoe,36.5,16.6,181.0,2850.0,FEMALE
64,Adelie,Biscoe,36.4,17.1,184.0,2850.0,FEMALE
190,Chinstrap,Dream,46.9,16.6,192.0,2700.0,FEMALE
3,Adelie,Torgersen,,,,,
339,Gentoo,Biscoe,,,,,


### Recap 

Now you know how to do some basic and important commands using the pandas library. You can create dataframes, import data, drop rows, and sort information. This library is powerful, and allows you to do a lot of cool things with datasets. 
 