# 2.3 Dataframes
The Pandas library for Python organizes data into 2-dimensional structures called dataframes, a collection of columns and rows organized into a table-like format. Like normal data tables, columns of data are distributed across the x-axis, and rows are likewise distributed across the y-axis. Dataframes should be a fairly easy concept to comprehend if you have worked with programs like Microsoft Excel before.

The dataframe is a class created by the Pandas library, meaning that it comes equipped with special methods (functions) that it can use on itself to apply functions. The DataFrame is really essentially a dictionary of Series objects, where each Series represents a column and is formatted as a NumPy array (one value for each row of data). We'll talk about Series in the next section. 

For now, you can think of the key of each item in the dictionary as the column name, and the value of the item is an array of data representing the rows.

## Creating a dataframe from scratch
By understanding how a dataframe is created, you will be able to more easily understand how to use it.

Imagine that you want to represent an important person in a Python data structure. You might create something like this:

In [26]:
important_person = {
    "firstName": "Christopher",
    "lastName": "Columbus",
    "age": 53,
    "city": "Lisbon",
    "country": "Portugal"
}

This dictionary works great for representing aspects of a single person. However, what if you want to add somebody else to this list? There are likely several ways that you could accomplish this task, but one way (how Pandas does it) is by turning the values of each key into lists:

In [27]:
important_people = {
    "firstName": ["Christopher", "Patrick"],
    "lastName": ["Columbus", "Henry"],
    "age": [53, 63],
    "city": ["Lisbon", "Studley"],
    "country": ["Portugal", "United States"]
}

Using this data structure, each key (ie. "firstName") acts as a column name while the list of values following it act as rows. Each row maintains its relationship with data in other columns because of its index position in the list.

This data structure can now be used to create a Pandas dataframe. We can do this by using the DataFrame function after importing Pandas. Note that by convention, Pandas dataframes are typically stored in a variable that is or contains "df". Don't forget to import pandas!

In [28]:
import pandas as pd
important_people_df = pd.DataFrame(important_people)
important_people_df

Unnamed: 0,firstName,lastName,age,city,country
0,Christopher,Columbus,53,Lisbon,Portugal
1,Patrick,Henry,63,Studley,United States


## Importing external data into a dataframe
Many times, you will create Pandas dataframes by importing data from an external file. Pandas comes pre-equipped with methods for importing data from many different file types, including CSV files and Excel files. These functions are shown below

In this example, I downloaded one of the most classic datasets in a .csv format, the Titanic survival
dataset. The file contains information about each passenger on the famous Titanic cruiseliner,
including their passenger class, sex, embark location, and if they survived or not.

In [16]:
df = pd.read_csv("./data/titanic.csv")

The "read_" functions in Pandas require a path to the file, but other optional parameters can be added as well to accommodate your data format, allowing you to specify a line separator, column delimiter, and headers, among other things. These functions automatically convert the data into a dataframe.

In [17]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Initial observations of the data
At first glance of our table, we can see a lot of information. First, note that all the columns of
the table have names, except for the leftmost one. The leftmost column in a dataframe is called
the index, and it is used to attach a unique identifier to each row. It is used quite often to access
specific rows of the dataframe.

Also note that the lower left-hand corner specifies that this dataset contains 891 rows and 12
columns. However, not all of them are presented in the data seen above. Pandas, by default,
prevents users from displaying too many rows/columns of data, but this setting can be overridden.

It’s often useful when exploring data to see the shape of the dataframe, or, in other words, how
many rows or columns it has. We can see it visually above but sometimes it’s helpful to get it
programmatically as well.

You can find the number of rows and columns by calling the `shape` attribute of the dataframe. Note that `shape` is not a function, so it doesn't use parentheses.

In [18]:
df.shape

(891, 12)

We can also get some useful summary information from our dataframe using the .info() method,
which tells us about each of our columns and their datatypes. “int64” indicates an integer, “object”
indicates a string, and “float64” indicates a floating point number (decimal).

Observe that there are 11 columns (excluding the index column) and that not each column has
data in all of its rows. For example, the “Cabin” column only has 204 non-null entries, even though
there are 891 rows.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


There’s also a method called “.describe()” which we can use to get standard information about
columns in the dataframe with numerical information, including the count, mean, standard deviation, min, max, and quartiles

In [20]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


At many points during the data exploration process you will likely need to refer to the original
dataframe. To avoid printing out all of the rows, use the .head() function, which prints out the
first five rows, or the .tail() function, which prints out the last 5 rows.

In [22]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [23]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q
