# Pandas Dataframes

The `pandas.DataFrame` class is used to represent data in a 2D tabular format. It's useful for manipulating data in a spreadsheet or CSV file.

## Loading Data

A CSV file can be imported with the `read_csv()` or `read_excel()` function. Pandas is generally smart enough to figure out what the columns are named and what types they are.

When a `DataFrame` is evaluated, the notebook interface will render it as a nicely formatted table. If a `DataFrame` is printed with the `print()` function, the output will be plaintext representation.

In [2]:
import pandas as pd
import numpy as np

# Read data from a local CSV file into a DataFrame object
df = pd.read_csv('data/weather/nyc.csv')
df

Unnamed: 0.1,Unnamed: 0,Date,Max.TemperatureF,Mean.TemperatureF,Min.TemperatureF,Max.Dew.PointF,MeanDew.PointF,Min.DewpointF,Max.Humidity,Mean.Humidity,...,Min.VisibilityMiles,Max.Wind.SpeedMPH,Mean.Wind.SpeedMPH,Max.Gust.SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees.br...,city,season
0,1,1948-07-01,84,78.0,72,71,65,58,93,65,...,2.0,16,8,,0.00,0.0,Fog,264<br />,New York City (USA),Summer
1,2,1948-07-02,82,72.0,63,62,53,49,76,51,...,10.0,16,10,,0.00,0.0,,315<br />,New York City (USA),Summer
2,3,1948-07-03,78,71.0,64,66,58,53,84,62,...,5.0,14,6,,0.00,0.0,,203<br />,New York City (USA),Summer
3,4,1948-07-04,84,76.0,68,68,63,56,90,67,...,2.0,12,5,,0.00,0.0,Fog,198<br />,New York City (USA),Summer
4,5,1948-07-05,93,82.0,70,74,69,65,93,71,...,3.0,18,8,,0.00,0.0,Fog-Rain-Thunderstorm,218<br />,New York City (USA),Summer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24555,24623,2015-12-27,63,56.0,48,58,51,37,96,82,...,0.0,23,12,29.0,0.08,8.0,Fog-Rain,283<br />,New York City (USA),Winter
24556,24624,2015-12-28,48,42.0,35,35,25,19,70,57,...,6.0,23,14,37.0,0.05,7.0,Rain-Snow,39<br />,New York City (USA),Winter
24557,24625,2015-12-29,49,42.0,34,46,40,30,93,88,...,1.0,24,12,30.0,0.66,8.0,Rain-Snow,28<br />,New York City (USA),Winter
24558,24626,2015-12-30,51,46.0,41,48,41,37,93,85,...,2.0,22,7,26.0,0.37,8.0,Rain,65<br />,New York City (USA),Winter


## DataFrame Metadata

The `DataFrame.head()` function is a good place to start with inspecting the data. It will return a new `DataFrame` with just the first _n_ rows. It's useful for getting a feel for the columns and the kind of data you might have in the rest of the `DataFrame` object.

In [3]:
df.head(n=4)

Unnamed: 0.1,Unnamed: 0,Date,Max.TemperatureF,Mean.TemperatureF,Min.TemperatureF,Max.Dew.PointF,MeanDew.PointF,Min.DewpointF,Max.Humidity,Mean.Humidity,...,Min.VisibilityMiles,Max.Wind.SpeedMPH,Mean.Wind.SpeedMPH,Max.Gust.SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees.br...,city,season
0,1,1948-07-01,84,78.0,72,71,65,58,93,65,...,2.0,16,8,,0.0,0.0,Fog,264<br />,New York City (USA),Summer
1,2,1948-07-02,82,72.0,63,62,53,49,76,51,...,10.0,16,10,,0.0,0.0,,315<br />,New York City (USA),Summer
2,3,1948-07-03,78,71.0,64,66,58,53,84,62,...,5.0,14,6,,0.0,0.0,,203<br />,New York City (USA),Summer
3,4,1948-07-04,84,76.0,68,68,63,56,90,67,...,2.0,12,5,,0.0,0.0,Fog,198<br />,New York City (USA),Summer


The `DataFrame.shape` and `DataFrame.size` properties can be used to figure out how many elements there are in the `DataFrame` object. These are fields on the object, _not_ methods. Note the lack of parentheses. 

`shape` returns a tuple of the dimensions (i.e., the rows and columns).

`size` return the total number of elements (i.e., rows x columns).

In [4]:
print(df.shape)
print(df.size)

# To show that the size is the product of the dimensions
print(np.prod(df.shape))

(24560, 26)
638560
638560


Even more detailed information is available from the `DataFrame.info()` function. This function gives a summary with information like the shape, the column names and types data in the object.


The amount of data in the `DataFrame` can [affect what data is displayed](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html), such as whether the non-null counts are shown.

There are also basic properties like `DataFrame.index` and `DataFrame.columns` to view row and column labels, respectively.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24560 entries, 0 to 24559
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Unnamed: 0                 24560 non-null  int64  
 1   Date                       24560 non-null  object 
 2   Max.TemperatureF           24560 non-null  int64  
 3   Mean.TemperatureF          24558 non-null  float64
 4   Min.TemperatureF           24560 non-null  int64  
 5   Max.Dew.PointF             24560 non-null  int64  
 6   MeanDew.PointF             24560 non-null  int64  
 7   Min.DewpointF              24560 non-null  int64  
 8   Max.Humidity               24560 non-null  int64  
 9   Mean.Humidity              24560 non-null  int64  
 10  Min.Humidity               24560 non-null  int64  
 11  Max.Sea.Level.PressureIn   24560 non-null  float64
 12  Mean.Sea.Level.PressureIn  24560 non-null  float64
 13  Min.Sea.Level.PressureIn   24560 non-null  flo

You can use `DataFrame.dtypes` to see the types of each of the columns. Where the values are mixed types or Strings, the reported type will be `object`. 

Using this information, the method `DataFrame.select_dtypes()` can be used to return a new `DataFrame` that has only certain columns you choose to `include` or `exclude`.

In [9]:
df.dtypes

Unnamed: 0                     int64
Date                          object
Max.TemperatureF               int64
Mean.TemperatureF            float64
Min.TemperatureF               int64
Max.Dew.PointF                 int64
MeanDew.PointF                 int64
Min.DewpointF                  int64
Max.Humidity                   int64
Mean.Humidity                  int64
Min.Humidity                   int64
Max.Sea.Level.PressureIn     float64
Mean.Sea.Level.PressureIn    float64
Min.Sea.Level.PressureIn     float64
Max.VisibilityMiles          float64
Mean.VisibilityMiles         float64
Min.VisibilityMiles          float64
Max.Wind.SpeedMPH              int64
Mean.Wind.SpeedMPH             int64
Max.Gust.SpeedMPH            float64
PrecipitationIn               object
CloudCover                   float64
Events                        object
WindDirDegrees.br...          object
city                          object
season                        object
dtype: object

In [13]:
# Since select_dtypes() returns a new DataFrame, you can chain on a call to head()
df.select_dtypes(include=['object']).head(n=5)

Unnamed: 0,Date,PrecipitationIn,Events,WindDirDegrees.br...,city,season
0,1948-07-01,0.0,Fog,264<br />,New York City (USA),Summer
1,1948-07-02,0.0,,315<br />,New York City (USA),Summer
2,1948-07-03,0.0,,203<br />,New York City (USA),Summer
3,1948-07-04,0.0,Fog,198<br />,New York City (USA),Summer
4,1948-07-05,0.0,Fog-Rain-Thunderstorm,218<br />,New York City (USA),Summer


In [14]:
df.axes

[RangeIndex(start=0, stop=24560, step=1),
 Index(['Unnamed: 0', 'Date', 'Max.TemperatureF', 'Mean.TemperatureF',
        'Min.TemperatureF', 'Max.Dew.PointF', 'MeanDew.PointF', 'Min.DewpointF',
        'Max.Humidity', 'Mean.Humidity', 'Min.Humidity',
        'Max.Sea.Level.PressureIn', 'Mean.Sea.Level.PressureIn',
        'Min.Sea.Level.PressureIn', 'Max.VisibilityMiles',
        'Mean.VisibilityMiles', 'Min.VisibilityMiles', 'Max.Wind.SpeedMPH',
        'Mean.Wind.SpeedMPH', 'Max.Gust.SpeedMPH', 'PrecipitationIn',
        'CloudCover', 'Events', 'WindDirDegrees.br...', 'city', 'season'],
       dtype='object')]