# 2.3 Dataframes
The Pandas library for Python organizes data into 2-dimensional structures called dataframes, a collection of columns and rows organized into a table-like format. Like normal data tables, columns of data are distributed across the x-axis, and rows are likewise distributed across the y-axis. Dataframes should be a fairly easy concept to comprehend if you have worked with programs like Microsoft Excel before.

The dataframe is a class created by the Pandas library, meaning that it comes equipped with special methods (functions) that it can use on itself to apply functions. The DataFrame is really essentially a dictionary of Series objects, where each Series represents a column and is formatted as a NumPy array (one value for each row of data). We'll talk about Series in the next section. 

For now, you can think of the key of each item in the dictionary as the column name, and the value of the item is an array of data representing the rows.

## Creating a dataframe from scratch
By understanding how a dataframe is created, you will be able to more easily understand how to use it.

Imagine that you want to represent an important person in a Python data structure. You might create something like this:

In [None]:
important_person = {
    "firstName": "Christopher",
    "lastName": "Columbus",
    "age": 53,
    "city": "Lisbon",
    "country": "Portugal"
}

This dictionary works great for representing aspects of a single person. However, what if you want to add somebody else to this list? There are likely several ways that you could accomplish this task, but one way (how Pandas does it) is by turning the values of each key into lists:

In [None]:
important_people = {
    "firstName": ["Christopher", "Patrick"],
    "lastName": ["Columbus", "Henry"],
    "age": [53, 63],
    "city": ["Lisbon", "Studley"],
    "country": ["Portugal", "United States"]
}

Using this data structure, each key (ie. "firstName") acts as a column name while the list of values following it act as rows. Each row maintains its relationship with data in other columns because of its index position in the list.

This data structure can now be used to create a Pandas dataframe. We can do this by using the DataFrame function after importing Pandas. Note that by convention, Pandas dataframes are typically stored in a variable that is or contains "df". Don't forget to import pandas!

In [None]:
import pandas as pd
important_people_df = pd.DataFrame(important_people)
important_people_df

## Importing external data into a dataframe
Many times, you will create Pandas dataframes by importing data from an external file. Pandas comes pre-equipped with methods for importing data from many different file types, including CSV files and Excel files. These functions are shown below

In this example, I downloaded one of the most classic datasets in a .csv format, the Titanic survival
dataset. The file contains information about each passenger on the famous Titanic cruiseliner,
including their passenger class, sex, embark location, and if they survived or not.

In [None]:
df = pd.read_csv("./data/titanic.csv")

The "read_" functions in Pandas require a path to the file, but other optional parameters can be added as well to accommodate your data format, allowing you to specify a line separator, column delimiter, and headers, among other things. These functions automatically convert the data into a dataframe.

In [None]:
df

Pandas can convert CSV files (`.csv`), Excel files (`.xlsx`), and many other file types to dataframes. If you have a special data type that you want to analyze, you should look up how to convert it to a dataframe by doing a Google search.

## Initial observations of the data
At first glance of our table, we can see a lot of information. First, note that all the columns of
the table have names, except for the leftmost one. The leftmost column in a dataframe is called
the index, and it is used to attach a unique identifier to each row. It is used quite often to access
specific rows of the dataframe.

Also note that the lower left-hand corner specifies that this dataset contains 891 rows and 12
columns. However, not all of them are presented in the data seen above. Pandas, by default,
prevents users from displaying too many rows/columns of data, but this setting can be overridden.

It’s often useful when exploring data to see the shape of the dataframe, or, in other words, how
many rows or columns it has. We can see it visually above but sometimes it’s helpful to get it
programmatically as well.

You can find the number of rows and columns by calling the `shape` attribute of the dataframe. Note that `shape` is not a function, so it doesn't use parentheses.

In [None]:
df.shape

We can also get some useful summary information from our dataframe using the .info() method,
which tells us about each of our columns and their datatypes. “int64” indicates an integer, “object”
indicates a string, and “float64” indicates a floating point number (decimal).

Observe that there are 11 columns (excluding the index column) and that not each column has
data in all of its rows. For example, the “Cabin” column only has 204 non-null entries, even though
there are 891 rows.

In [None]:
df.info()

There’s also a method called “.describe()” which we can use to get standard information about
columns in the dataframe with numerical information, including the count, mean, standard deviation, min, max, and quartiles

In [None]:
df.describe()

You can see above that if a column exists that has a numerical data type, the `.describe()` method does not return information about categorical data. You can get information regarding categorical data by passing in the parameter `include=[object]` to the method.

In [None]:
df.describe(include=[object])

You can even convert columns to objects to describe them categorically. For example, we might want to see the most frequent passenger class or age.

In [None]:
df.astype(object).describe()

At many points during the data exploration process you will likely need to refer to the original
dataframe. To avoid printing out all of the rows, use the .head() function, which prints out the
first five rows, or the .tail() function, which prints out the last 5 rows.

In [None]:
df.head()

In [None]:
df.tail()

## Seeing all columns and/or all rows
In an effort to save computing resources and space on your screen, Jupyter automatically limits the number of columns and rows that appear in a dataframe when it is printed out. You can change these settings with the following code:

In [None]:
pd.set_option('display.max_columns', None) # No limit for columns
pd.set_option('display.max_rows', None) # No limit for rows