 # Reading Data with Pandas

## Introduction

Another library that is integral to data science is Pandas. Have you ever looked at a table of data, maybe excel, and thought, "I wish I had that in a format I could use in Python!"? We'll you're in luck because that is exactly the type of problem that this new library can solve for us. In this lesson, we will introduce the ways in which Pandas is used and why it is such a powerful tool. We'll also show exactly how to get started using Pandas and common operations. 

## Objectives
* Importing Pandas
* Creating a DataFrame
* Reading a CSV

## Importing Pandas

The first thing we do with all librarys is import them into our code. And we know that our code is usually following some kind of pattern or convention, right? Well, Pandas is no exception to that. Just as we alias the NumPy library import `as np`, we alias Pandas `as pd`. Let's check it out.

In [8]:
import pandas as pd
print(pd)

<module 'pandas' from '/usr/local/lib/python3.6/site-packages/pandas/__init__.py'>


Great! we have Pandas imported and aliased, so that, now we can refer to the Pandas module as simply `pd`.

# Creating a DataFrame

A DataFrame is a new data type for us. So, in order to understand what a DataFrame is, it is important to have a visualization of the object. Below, we are taking some lists (some multi-dimensional), and using it to create a Pandas DataFrame.

Essentially, a DataFrame is an organized, formatted object containing rows and columns with data contained in the cells between the rows and colums. You can think of it as almost the Python equivalent of looking at a matrix in a CSV or excel file.

First, let's see the DataFrame without any real values populated. We will create the DataFrame with just columns and Rows.

> The syntax for creating a DataFrame is:


```python
pd.DataFrame(DATA, COLUMNS, INDEX(optional))
# without an index, the left column will just auto index using integers (i.e. 1, 2, 3, etc.)

OR

pd.DataFrame(data=['Python', 'Pandas', 'Flatiron School'], 
             columns=['Coolest Programming Lang','Neatest Python Library','Best Programming School']
            )
```

In [9]:
days_of_class = ['day1', 'day2', 'day3', 'day4']
attendance = ['absent', 'present']
empty_students = []
pd.DataFrame(data=empty_students, columns=attendance, index=days_of_class)

Unnamed: 0,absent,present
day1,,
day2,,
day3,,
day4,,


Now, if we add in a list of lists such as 4 lists containing 1 student who was absent, and 1 student who was present all inside another list (i.e. `[ [1,2], [1,2], [1,2], [1,2] ]`), we can provide the data we need for our DataFrame! 

Don't think too much about the multidimensional list. We will see those again in the future. But for now, focus on DataFrames! Below, we are creating the list of lists that will be used to show absent and present students for class.

In [10]:
present_stds = ['Anna', 'Billy', 'Meghan', 'George']
absent_stds = ['Hunter', 'Francine', 'Gail', 'Tucker']
students = [[absent_stds[0], present_stds[0]], [absent_stds[1], present_stds[1]], [absent_stds[2], present_stds[2]], [absent_stds[3], present_stds[3]]]

pd.DataFrame(students, index=days_of_class, columns=attendance)

Unnamed: 0,absent,present
day1,Hunter,Anna
day2,Francine,Billy
day3,Gail,Meghan
day4,Tucker,George


Now, let's see what we were talking about when we said our DataFrame would **auto index**... Well, our DataFrame will always need an index (or line number) and without the days of the week as our indexes, our DataFrame will populate the index with a list of integers for the number of rows of data in the DataFrame. Let's look at the same example as we have from above, but this time without the index.

In [11]:
pd.DataFrame(students, columns=attendance)

Unnamed: 0,absent,present
0,Hunter,Anna
1,Francine,Billy
2,Gail,Meghan
3,Tucker,George


Now, it looks like we have a *day **0***, but that's alright. The important thing here is to note that without data for your index, Pandas will simply use integers to denote rows. In fact, if we remove the data for the columns, Pandas will also use integers to create column titles. 

In [12]:
pd.DataFrame(students)

Unnamed: 0,0,1
0,Hunter,Anna
1,Francine,Billy
2,Gail,Meghan
3,Tucker,George


## Reading a CSV File

Okay, so, we have a brief introduction into Pandas and DataFrames. Now, while we *can* create DataFrames like we did above, we will largely be using pandas to read large files with **C**omma **S**eparated **V**alues and create DataFrames using those values. 

This is immensly useful for us since we will largely be using data that already exists somewhere to create insights and analytics. Taking the already existing information and importing it into a format that we can use to save, manipulate, and query is exaclty what we will use Pandas for. Let's take a look1

In our directory already, we have a csv file named `attendance.csv`. To get the information we want out of this file, we will need to read it first. To do that, we will use a method Pandas provides called `read_csv`, which whill do just that. It reads the file, and knowing that it is a CSV, it will create a DataFrame formatted in the way we want (i.e. with headers on the first line and all subsequent new lines will be the data listed underneath the headers).

In [13]:
csv = pd.read_csv('attendance.csv')
csv # this is our data frame that pandas has created with the information in our attendance csv file

Unnamed: 0,NAME,Day 1,Day 2,Day 3,Day 4,Day 5,Day 6,Day 7,Day 8,Day 9,Day 10
0,Jeffrey,0,1,1,0,0,0,1,1,0,1
1,Michelle,0,1,0,0,1,0,0,0,0,1
2,Carl,1,1,0,1,0,1,1,0,1,1
3,Chris,1,0,1,0,0,0,0,1,1,0
4,Kris,0,0,1,1,0,1,0,0,0,1
5,David,1,1,1,0,1,1,1,1,1,1
6,Hank,1,0,0,1,0,1,0,0,0,0
7,Gregory,1,1,0,1,1,0,1,0,1,1
8,Anna,0,0,1,0,0,1,0,1,1,1
9,Jordan,1,0,0,1,0,1,1,1,1,0


Now we might be wondering, "Great we can read a csv and make a DataFrame, but what can I do with this DataFrame?" 

Well, typically you will want to take this new information and turn it into some data type we can more easily work with, like a dictionary.

To do this, we can use the `to_dict()` method provided by Pandas, which uses the information in our DataFrame to create a dictionary.

In [22]:
attendance_dict = csv.to_dict()
print(attendance_dict)

{'NAME': {0: 'Jeffrey', 1: 'Michelle', 2: 'Carl', 3: 'Chris', 4: 'Kris', 5: 'David', 6: 'Hank', 7: 'Gregory', 8: 'Anna', 9: 'Jordan', 10: 'Kayla', 11: 'Tucker', 12: 'Pattie', 13: 'Morgan', 14: 'Max', 15: 'Rick', 16: 'Stephanie'}, 'Day 1': {0: 0, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1, 7: 1, 8: 0, 9: 1, 10: 0, 11: 1, 12: 1, 13: 1, 14: 0, 15: 1, 16: 0}, 'Day 2': {0: 1, 1: 1, 2: 1, 3: 0, 4: 0, 5: 1, 6: 0, 7: 1, 8: 0, 9: 0, 10: 1, 11: 0, 12: 0, 13: 1, 14: 0, 15: 1, 16: 0}, 'Day 3': {0: 1, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 0, 7: 0, 8: 1, 9: 0, 10: 1, 11: 0, 12: 1, 13: 1, 14: 0, 15: 0, 16: 0}, 'Day 4': {0: 0, 1: 0, 2: 1, 3: 0, 4: 1, 5: 0, 6: 1, 7: 1, 8: 0, 9: 1, 10: 0, 11: 0, 12: 0, 13: 1, 14: 0, 15: 0, 16: 1}, 'Day 5': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0, 5: 1, 6: 0, 7: 1, 8: 0, 9: 0, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 0, 16: 1}, 'Day 6': {0: 0, 1: 0, 2: 1, 3: 0, 4: 1, 5: 1, 6: 1, 7: 0, 8: 1, 9: 1, 10: 1, 11: 0, 12: 1, 13: 1, 14: 1, 15: 0, 16: 1}, 'Day 7': {0: 1, 1: 0, 2: 1, 3: 0, 4: 0, 5: 1, 6: 0

> **Note:** A 0 stands for 'absent' and a 1 stands for 'present' in the attendance data above.

Now that we have our data in a format that we know how to work with more easily, we can begin to use this information more easily in our programs

We can analyze the the attendance data to see things like *which day had the most absences?*, *is there a pattern to any spikes in absences or lack of absences?*, etc.

## Summary

In this lesson, we introduced pandas and DataFrames. We looked at common applications for using DataFrames as well as practiced using Pandas ino rder to extract data from a CSV file to use it in our program in Python. We will continue to grow our Pandas skills as we continue learning more about Data Science and working with data in Python.