# Introduction to Pandas

**Pandas** is a package for data manipulation and analysis in Python. The name Panda is derived from **Pan**el **Da**ta.

Pandas incorporate two additional data structures into Python;
* Pandas series
* Pandas dataframe
These data structures allow us to work with labelled and relational data in an easy and intuitive manner.

## Why use Pandas?
The recent successes of machine learning algorithmsis partly due to the huge amounts of data that we have available to train our algorithms.
However, when it comes to data, quantity is not the only thing that matters, the quality of data is just as important. It often happens that large datasets don't come ready to be fed into your learning algorithms. They will often have incorrect values, missing values, outliers etc.
One important step in machine learning is to look at your data first and make sure it's well suited for your training algorithm by doing some basic data analysis.
This is where Pandas come in.
Pandas Series and Pandas Dataframes are designed for fast data analysis and manipulation. They are also flexible and easy to use.
Below are some features that make it an excellent package for data analysis.
* It allows the use of labels for rows and columns
* Can calculate rolling statistics on time series data
* Easy handling of NaN values
* It's able to load data of different formats into DataFrames
* Can join and merge different datasets together
* Integrates with NumPy and Matplotlib

## Creating a Panda series

A Panda series is a one-dimensional array like object that can hold many datatypes.

One main difference between Panda Series and a NumPy ndarray is that you can assign an index label to each element in the Panda (you can name the indices in your Panda Series).

Another big difference between the two is that a Panda Series can contain elements of different data types.

Let's create a Pandas Series.

The command `pd.Series(data, index)` will create a Pandas series where `index` is the list of index labels.

We can create a Pandas Series to store a grocery list.

In [1]:
import pandas as pd

groceries = pd.Series(data=[30, 6, "Yes", "No"], index=["eggs", "apples", "milk", "bread"])

groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

> Pandas Series are displayed with indices in the first column and the data in the second column.

Just like NumPy ndarrays, Pandas Series have attributes that allow us to get information from the series in an easy way.

In [5]:
print("Groceries has shape {}".format(groceries.shape))
print("Groceries has dimension {}".format(groceries.ndim))
print("Groceries has a total of {} elements".format(groceries.size))

Groceries has shape (4,)
Groceries has dimension 1
Groceries has a total of 4 elements


We can also print the index labels and the data separately. This is helpful if you don't happen to know what the index labels of the Pandas Series are.

In [5]:
print("The data in groceries is {}.".format(groceries.values))
print("The index of groceries is {}.".format(groceries.index))

The data in groceries is [30 6 'Yes' 'No'].
The index of groceries is Index(['eggs', 'apples', 'milk', 'bread'], dtype='object').


If you are dealing with a very large Pandas Series and you're not sure whether an index label exists, you can check using the `in` command.

In [6]:
print("Is bananas an index label in groceries? {}.".format("bananas" in groceries))
print("Is bread an index label in groceries? {}.".format("bread" in groceries))

Is bananas an index label in groceries? False.
Is bread an index label in groceries? True.


## Accessing and deleting elements in a Pandas Series

One great advantage of a Pandss Series is that it allows us to access data in many different ways.

Elements can be accessed using index labels or numerical indices inside square brackets. Both positive and negative indices can be used to access elements from the beginning and from the end of the Series respectively.

Since there are different ways to access elements, Pandas Series have devised a way to remove any abiguity when accessing elements. It has two attributes that allow us to explicitly write what we mean to do. These two attributes are;
* `.loc` - stands for location. It's used to explicitly state that we are using a labeled index.
* `.iloc` - stands for integer location. It's used to explicitly state that we are using a numerical index.

Let's see some examples.

In [10]:
print("How many eggs do we need to buy? {}".format(groceries["eggs"]))
print()
print("Do we need milk and bread? \n {}".format(groceries[["milk", "bread"]]))
print()
print("How many eggs and apples do we need to buy? \n {}".format(groceries[["eggs", "apples"]]))
print()
print("How many eggs and apples do we need to buy? \n {}".format(groceries[[0, 1]]))
print()
print("Do we need bread? {}".format(groceries[-1]))
print()

How many eggs do we need to buy? 30

Do we need milk and bread? 
 milk     Yes
bread     No
dtype: object

How many eggs and apples do we need to buy? 
 eggs      30
apples     6
dtype: object

How many eggs and apples do we need to buy? 
 eggs      30
apples     6
dtype: object

Do we need bread? No



Pandas Series are mutable. We can change elements after the Series has already been created.

In [13]:
print("Original groceries list: \n {}".format(groceries))

groceries["eggs"] = 82
print()
print("Modified groceries list: \n {}".format(groceries))

Original groceries list: 
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object

Modified groceries list: 
 eggs       82
apples      6
milk      Yes
bread      No
dtype: object


We can delete items from Pandas Series using the `drop()` method. By default, this method drops elements **out of place**, meaning that the original Series will not be altered.

In [14]:
print("Original groceries list: \n {}".format(groceries))

print()
print("remove apples (out of place): \n {}".format(groceries.drop("apples")))
print()
print("groceries list after the drop: \n {}".format(groceries))

Original groceries list: 
 eggs       82
apples      6
milk      Yes
bread      No
dtype: object

remove apples (out of place): 
 eggs      82
milk     Yes
bread     No
dtype: object

groceries list after the drop: 
 eggs       82
apples      6
milk      Yes
bread      No
dtype: object


To delete items in place, we need to use the `inplace` keyword and set its value to `True`.

In [15]:
print("Original groceries list: \n {}".format(groceries))

print()
print("remove apples (in place): \n {}".format(groceries.drop("apples", inplace=True)))
print()
print("groceries list after the drop: \n {}".format(groceries))

Original groceries list: 
 eggs       82
apples      6
milk      Yes
bread      No
dtype: object

remove apples (in place): 
 None

groceries list after the drop: 
 eggs      82
milk     Yes
bread     No
dtype: object


## Arithmetic Operations on Pandas Series

Just like with NumPy ndarrays, we can perform element-wise operations on Pandas Series.

In [16]:
fruits = pd.Series(data=[10, 6, 3], index=["apples", "oranges", "bananas"])

fruits

apples     10
oranges     6
bananas     3
dtype: int64

We can modify the data in fruits by performing basic arithmetic operations. Let's see some examples.

In [17]:
print("Original list of fruits \n {}".format(fruits))
print()
print("fruits + 2: \n {}".format(fruits + 2))
print()
print("fruits - 2: \n {}".format(fruits - 2))
print()
print("fruits * 2: \n {}".format(fruits * 2))
print()
print("fruits / 2: \n {}".format(fruits / 2))

Original list of fruits 
 apples     10
oranges     6
bananas     3
dtype: int64

fruits + 2: 
 apples     12
oranges     8
bananas     5
dtype: int64

fruits - 2: 
 apples     8
oranges    4
bananas    1
dtype: int64

fruits * 2: 
 apples     20
oranges    12
bananas     6
dtype: int64

fruits / 2: 
 apples     5.0
oranges    3.0
bananas    1.5
dtype: float64


Mathematical functions such as `sqrt(x)` from NumPy can also be performed.

In [18]:
import numpy as np

print("Original list of fruits \n {}".format(fruits))
print()
print("exp(fruits): \n {}".format(np.exp(fruits)))
print()
print("sqrt(fruits): \n {}".format(np.sqrt(fruits)))
print()
print("power(fruits, 2): \n {}".format(np.power(fruits, 2)))

Original list of fruits 
 apples     10
oranges     6
bananas     3
dtype: int64

exp(fruits): 
 apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64

sqrt(fruits): 
 apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64

power(fruits, 2): 
 apples     100
oranges     36
bananas      9
dtype: int64


Arithmetic operations can be applied only on selected items in the fruits list.

In [20]:
print("Original list of fruits \n {}".format(fruits))
print()
print("Amount of bananas + 2 = {}".format(fruits["bananas"] + 2))
print()
print("Amount of apples - 2 = {}".format(fruits.iloc[0] - 2))
print()
print("We double the amount of apples and oranges: \n {}".format(fruits[["apples", "oranges"]] * 2))
print()
print("We half the amount of apples and oranges: \n {}".format(fruits[["apples", "oranges"]] / 2))
print()

Original list of fruits 
 apples     10
oranges     6
bananas     3
dtype: int64

Amount of bananas + 2 = 5

Amount of apples - 2 = 8

We double the amount of apples and oranges: 
 apples     20
oranges    12
dtype: int64

We half the amount of apples and oranges: 
 apples     5.0
oranges    3.0
dtype: float64



Aruthmetic operations can also be applied on Pandas Series of mixed datatype, provided the arithmetic operation is defined for all data types in the series, otherwise, you get an error.

Let's see what happens when we multiply the groceries list by 2.

In [21]:
groceries * 2

eggs        164
milk     YesYes
bread      NoNo
dtype: object

Panda doubles each of the elements, including the strings.

If you were to apply an operation that is valid for numbers but not for strings e.g. `/` (division), you will get an error.

When you have mixed datatypes, make sure that arithmetic operations are valid for all datatypes in your Pandas Series.

## Creating Pandas DataFrames

Pandas DataFrames are two dimensional data structures with labelled rows and columns that can hold multiple datatypes.

Pandas DataFrames can be thought of as being similar to spreadsheets.

Pandas DataFrames can ether be created from dictionaries or from a file.

### Creating Pandas DataFrames from a dictionary of Pandas Series

We will create a dictionary of items purchased by two people; Alice and Bob.

In [23]:
import pandas as pd

items = {
    "Bob": pd.Series(data=[245, 25, 55], index=["bike", "pants", "watch"]),
    "Alice": pd.Series(data=[40, 110, 500, 45], index=["book", "glasses", "bike", "pants"]),
}

print(type(items))

<class 'dict'>


Now that we have a dictionary, we can pass it to the `pd.DataFrame()` function.

In [25]:
shopping_carts = pd.DataFrame(items)

shopping_carts

Unnamed: 0,Bob,Alice
bike,245.0,500.0
book,,40.0
glasses,,110.0
pants,25.0,45.0
watch,55.0,


There are several things to notice here;

1. Data frames are displayed in tabular form (like a spreadsheet)
2. The rows in the DataFrame are built from the union of the labels from the two Pandas Series in the dictionary
3. The column labels are taken from the dictionary keys
4. There are some NaN values. This is Pandas way of indicating that it doesn't have a value for a particular row and column index.

In the above example, we created a Pandas DataFrame from a dictionary of Pandas Series that have index labels. If we don't provide index labels in the Pandas Series, Pandas will use numerical row indexes when it creates a DataFrame. The numerical row indexes start at 0.

In [26]:
data = {
    "Bob": pd.Series([245, 25, 55]),
    "Alice": pd.Series([40, 110, 500, 45])
}

df = pd.DataFrame(data)

df

Unnamed: 0,Bob,Alice
0,245.0,40
1,25.0,110
2,55.0,500
3,,45


Just like with Pandas Series, we can extract information from DataFrames using attributes.

In [28]:
print("shopping_carts has shape {}".format(shopping_carts.shape))
print("shopping_carts has dimension {}".format(shopping_carts.ndim))
print("shopping_carts has a total of {} elements".format(shopping_carts.size))
print()
print("The data in shopping_carts is: \n {}".format(shopping_carts.values))
print()
print("The row index in shopping_carts is: \n {}".format(shopping_carts.index))
print()
print("The column index in shopping_carts is: \n {}".format(shopping_carts.columns))

shopping_carts has shape (5, 2)
shopping_carts has dimension 2
shopping_carts has a total of 10 elements

The data in shopping_carts is: 
 [[245. 500.]
 [ nan  40.]
 [ nan 110.]
 [ 25.  45.]
 [ 55.  nan]]

The row index in shopping_carts is: 
 Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

The column index in shopping_carts is: 
 Index(['Bob', 'Alice'], dtype='object')


When creating `shopping_carts`, we passed in the entire dictionary to `pd.DataFrame()`. However, there may be cases where you're only interested in a subset of the data.

Pandas allow us to select which data we want to put into our DataFrame by means of keywords `columns` and `index`.

In [31]:
bob_shopping_cart = pd.DataFrame(items, columns=["Bob"])

bob_shopping_cart

Unnamed: 0,Bob
bike,245
pants,25
watch,55


In [32]:
x = pd.DataFrame(items, index=["pants", "book"])

x

Unnamed: 0,Bob,Alice
pants,25.0,45
book,,40


### Creating Pandas DataFrames from a dictionary of lists

Pandas DataFrames can also be created from a dictionary of lists. However, all the lists must be of the same length.

In [34]:
data = {
    "integers": [1, 2, 3],
    "floats": [4.5, 8.2, 9.6]
}

df = pd.DataFrame(data)

df

Unnamed: 0,integers,floats
0,1,4.5
1,2,8.2
2,3,9.6


Notice that we are using numerical row indices since labels were not provided.

Labels can be added by using the `index` keyword.

In [36]:
df = pd.DataFrame(data, index=["label 1", "label 2", "label 3"])

df

Unnamed: 0,integers,floats
label 1,1,4.5
label 2,2,8.2
label 3,3,9.6


### Creating Pandas DataFrames from a list of dictionaries

In [6]:
items2 = [
    {
        "bikes": 20,
        "pants": 30,
        "watches": 35
    },
    {
        "watches": 10,
        "glasses": 50,
        "bikes": 15,
        "pants": 5
    }
]

store_items = pd.DataFrame(items2)

store_items

Unnamed: 0,bikes,glasses,pants,watches
0,20,,30,35
1,15,50.0,5,10


Again, we can label the rows.

In [7]:
store_items = pd.DataFrame(items2, index=["store 1", "store 2"])

store_items

Unnamed: 0,bikes,glasses,pants,watches
store 1,20,,30,35
store 2,15,50.0,5,10


## Accessing elements in Pandas DataFrames

Elements can be accessed in Pandas DataFrames in many different ways.

In general, we can access rows, columns or individual elements of the DataFrame by using the row and column labels.

In [8]:
print(store_items)

# accessing via columns
print()
print("How many bikes are in each store? \n {}".format(store_items[["bikes"]]))
print()
print("How many bikes and pants are in each store? \n {}".format(store_items[["bikes", "pants"]]))

# accessing via rows
print()
print("What items are in store 1? \n {}".format(store_items.loc[["store 1"]]))
print()
print("How many bikes are in store 2? {}".format(store_items["bikes"]["store 2"]))

         bikes  glasses  pants  watches
store 1     20      NaN     30       35
store 2     15     50.0      5       10

How many bikes are in each store? 
          bikes
store 1     20
store 2     15

How many bikes and pants are in each store? 
          bikes  pants
store 1     20     30
store 2     15      5

What items are in store 1? 
          bikes  glasses  pants  watches
store 1     20      NaN     30       35

How many bikes are in store 2? 15


When accessing individual elements e.g. in the last example, labels should be provided with the column label first i.e. `dataFrame[column][row]`.

DataFrames can also be modified by adding rows or columns. e.g. to add shirts to both stores;

In [9]:
store_items["shirts"] = [15, 2]

store_items

Unnamed: 0,bikes,glasses,pants,watches,shirts
store 1,20,,30,35,15
store 2,15,50.0,5,10,2


New columns are added at the end of the DataFrame.

New columns can also be added by using arithmetic operations between two columns, e.g. to create a suits column from a pants and shirts column;

In [10]:
store_items["suits"] = store_items["pants"] + store_items["shirts"]

store_items

Unnamed: 0,bikes,glasses,pants,watches,shirts,suits
store 1,20,,30,35,15,45
store 2,15,50.0,5,10,2,7


If you opened a new store and needed to add the number of items in stock of that new store into your DataFrame, you can add a new row to the `store_items` DataFrame.

To add rows to the `store_items` DataFrame, you have to create a new DataFrame and then `append` it to the original DataFrame.

In [11]:
new_items = [{
    "bikes": 20,
    "pants": 30,
    "watches": 35,
    "glasses": 4
}]

new_store = pd.DataFrame(new_items, index=["store 3"])

new_store

Unnamed: 0,bikes,glasses,pants,watches
store 3,20,4,30,35


In [12]:
store_items = store_items.append(new_store)

store_items

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,bikes,glasses,pants,shirts,suits,watches
store 1,20,,30,15.0,45.0,35
store 2,15,50.0,5,2.0,7.0,10
store 3,20,4.0,30,,,35


New columns can also be added to the DataFrame using only data from particular rows in particular columns.

e.g. to stock stores 2 and 3 with an item called **new watches** (the quantity of **new watches** will be the same as the current stock of **watches** in both stores respectively).

In [13]:
store_items['new watches'] = store_items['watches'][1:]

store_items

Unnamed: 0,bikes,glasses,pants,shirts,suits,watches,new watches
store 1,20,,30,15.0,45.0,35,
store 2,15,50.0,5,2.0,7.0,10,10.0
store 3,20,4.0,30,,,35,35.0


It's possible to insert new columns into the DataFrames anywhere we want using the  `dataframe.insert(loc, label, data)` method. Let's add a column named shoes right before suits.

In [14]:
store_items.insert(4, "shoes", [8, 5, 0])

store_items

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches,new watches
store 1,20,,30,15.0,8,45.0,35,
store 2,15,50.0,5,2.0,5,7.0,10,10.0
store 3,20,4.0,30,,0,,35,35.0


We can also delete rows and columns.

Columns can be deleted using the `pop()` method.

Both columns and rows can be deleted using the `drop()` method which has an `axis` keyword. This keyword is used to determine whether to delete a column or a row.

In [15]:
store_items.pop("new watches")

store_items

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,,0,,35


In [16]:
# remove the watches and shoes columns
store_items = store_items.drop(["watches", "shoes"], axis=1)

store_items

Unnamed: 0,bikes,glasses,pants,shirts,suits
store 1,20,,30,15.0,45.0
store 2,15,50.0,5,2.0,7.0
store 3,20,4.0,30,,


In [17]:
# remove store 2 and store 1 rows
store_items = store_items.drop(["store 2", "store 1"], axis=0)

store_items

Unnamed: 0,bikes,glasses,pants,shirts,suits
store 3,20,4.0,30,,


To change column labels, use the `rename()` method e.g. to rename "bikes" to "hats"...

In [18]:
store_items = store_items.rename(columns={"zebra": "hats"})

store_items

Unnamed: 0,bikes,glasses,pants,shirts,suits
store 3,20,4.0,30,,


`rename()` is also used to change row labels.

In [19]:
store_items = store_items.rename(index={"store 3": "last store"})

store_items

Unnamed: 0,bikes,glasses,pants,shirts,suits
last store,20,4.0,30,,


The index can also be changed to be one of the columns in the DataFrame.

In [20]:
store_items = store_items.set_index("pants")

store_items

Unnamed: 0_level_0,bikes,glasses,shirts,suits
pants,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,20,4.0,,


## Dealing with NaN

As mentioned earlier, before we begin training our learning algorithms with large datasets, we need to clean the data first. This means we need to have a method of detecting and correcting errors in our data.

Any given dataset may have bad data e.g. outliers or incorrect values but the type of bad data that is encountered the most is missing values.

Pandas assigns `NaN` to missing data.

We will learn how to detect and deal with `NaN` values.

In [21]:
items2 = [
    {
        "bikes": 20,
        "pants": 30,
        "watches": 35,
        "shirts": 15,
        "shoes": 8,
        "suits": 45
    },
    {
        "watches": 10,
        "glasses": 50,
        "bikes": 15,
        "pants": 5,
        "shirts": 2,
        "shoes": 5,
        "suits": 7
    },
    {
        "bikes": 20,
        "pants": 30,
        "watches": 35,
        "glasses": 4,
        "shoes": 10
    }
]

# create a dataframe and provide a row index
store_items = pd.DataFrame(items2, index=["store 1", "store 2", "store 3"])

store_items

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,,10,,35


To determine the number of `NaN` values, we can combine the `isnull()` and `sum()` methods.

In [22]:
x = store_items.isnull().sum().sum()

print("The number of NaN values in our DataFrame is {}".format(x))

The number of NaN values in our DataFrame is 3


* The isnull() method returns a Boolean DataFrame of the same size as `store_items`. It indicates with `True` for elements that have `NaN` values and `False` otherwise.

In [23]:
store_items.isnull()

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,False,True,False,False,False,False,False
store 2,False,False,False,False,False,False,False
store 3,False,False,False,True,False,True,False


* In Pandas, logical `True` values have a numerical value of 1 and `False` has a numerical value of 0.
* In order to count the total number of logical values, we call `sum()` twice. We have to call it twice because the first call returns a `PandasSeries` with sums of logical `True`s.

In [24]:
store_items.isnull().sum()

bikes      0
glasses    1
pants      0
shirts     1
shoes      0
suits      1
watches    0
dtype: int64

* The second `sum()` adds up the logical `True`s in the `PandasSeries`

Instead of counting `NaN`s, we can do the opposite and count the number of non-NaN. This is done by using the `count()` method as shown below.

In [26]:
print()
print("Number of non-NaN values in the columns of our DataFrame: \n {}".format(store_items.count()))


Number of non-NaN values in the columns of our DataFrame: 
 bikes      3
glasses    2
pants      3
shirts     2
shoes      3
suits      2
watches    3
dtype: int64


Now that we know how to determine if our dataset has `NaN` values, we can either delete or replace them.

To delete them, we can eliminate rows or columns from our dataframe that contain any `NaN` values using the `dropna()` method.

In [30]:
# drop any rows with NaN
store_items.dropna(axis=0)

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 2,15,50.0,5,2.0,5,7.0,10


In [31]:
# drop any columns with NaN
store_items.dropna(axis=1)

Unnamed: 0,bikes,pants,shoes,watches
store 1,20,30,8,35
store 2,15,5,5,10
store 3,20,30,10,35


The `dropna()` method works out of place. The original DataFrame is not modified. To have it work in place, we can pass the keyword argument `inplace` inside the call to `dropna()`.

```
store_items.dropna(axis=1, inplace=True)
```

Instead of eliminating `NaN` values, we can instead replace them with suitable values. e.g. if we chose to replace all `NaN` values with 0, we could use `fillna()`.

In [33]:
store_items.fillna(0)

Unnamed: 0,bikes,glasses,pants,shirts,shoes,suits,watches
store 1,20,0.0,30,15.0,8,45.0,35
store 2,15,50.0,5,2.0,5,7.0,10
store 3,20,4.0,30,0.0,10,0.0,35


`fillna()` can also be used to replace `NaN` values with previous values in the DataFrame.

# SKIPPED SOME NOTES ON fillna, JUST BECAUSE

...

## Loading Data into Pandas DataFrame

In machine learning, you will most likely use data from many sources to train your learning algorithms. Pandas allow us to load databases of different formats into DataFrames. One of the most oppular formats is `csv` (comma separated values).

We can load csv files into DataFrames using the `pd.read_csv()` function.

Let's load Google stock data into a Pandas DataFrame.

In [6]:
Google_stock = pd.read_csv('goog-1.csv')

print("Google_stock is of type: {}".format(type(Google_stock)))
print("Google_stock has shape: {}".format(Google_stock.shape))

Google_stock is of type: <class 'pandas.core.frame.DataFrame'>
Google_stock has shape: (3313, 7)


We can see that the csv file has been loaded into a Pandas DataFrame and that it has 3313 rows vs 7 columns. Now let's look at the data.

In [7]:
Google_stock

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2004-08-19,49.676899,51.693783,47.669952,49.845802,49.845802,44994500
1,2004-08-20,50.178635,54.187561,49.925285,53.805050,53.805050,23005800
2,2004-08-23,55.017166,56.373344,54.172661,54.346527,54.346527,18393200
3,2004-08-24,55.260582,55.439419,51.450363,52.096165,52.096165,15361800
4,2004-08-25,52.140873,53.651051,51.604362,52.657513,52.657513,9257400
5,2004-08-26,52.135906,53.626213,51.991844,53.606342,53.606342,7148200
6,2004-08-27,53.700729,53.959049,52.503513,52.732029,52.732029,6258300
7,2004-08-30,52.299839,52.404160,50.675404,50.675404,50.675404,5235700
8,2004-08-31,50.819469,51.519913,50.749920,50.854240,50.854240,4954800
9,2004-09-01,51.018177,51.152302,49.512966,49.801090,49.801090,9206800


We can see it's a large dataset and Pandas has assigned numerical row indices to the DataFrame. Pandas also used the labels in the csv to assign the column labels.

When dealing with large datasets like this one, it's often useful to take a look at the first few rows instead of the whole dataset. We can do this using the `head()` method as shown below.

In [8]:
Google_stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2004-08-19,49.676899,51.693783,47.669952,49.845802,49.845802,44994500
1,2004-08-20,50.178635,54.187561,49.925285,53.80505,53.80505,23005800
2,2004-08-23,55.017166,56.373344,54.172661,54.346527,54.346527,18393200
3,2004-08-24,55.260582,55.439419,51.450363,52.096165,52.096165,15361800
4,2004-08-25,52.140873,53.651051,51.604362,52.657513,52.657513,9257400


We can also have a look at the last few rows using `tail()`.

In [9]:
Google_stock.tail()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
3308,2017-10-09,980.0,985.424988,976.109985,977.0,977.0,891400
3309,2017-10-10,980.0,981.570007,966.080017,972.599976,972.599976,968400
3310,2017-10-11,973.719971,990.710022,972.25,989.25,989.25,1693300
3311,2017-10-12,987.450012,994.119995,985.0,987.830017,987.830017,1262400
3312,2017-10-13,992.0,997.210022,989.0,989.679993,989.679993,1157700


If we need more control, we can use `head(N)` or `tail(N)` to display the first N and last N rows respectively.

Let's do a quick check to see if we have any NaN values in our dataset.

Instead of using the `isnull().sum().sum()`, we can do `isnull().any()` to check whether any of the columns contain NaN values.

In [10]:
Google_stock.isnull().any()

Date         False
Open         False
High         False
Low          False
Close        False
Adj Close    False
Volume       False
dtype: bool

We have no NaN values.

When dealing with large datasets, it's useful to get statistical information about them. The `describe()` method gets descriptive statistics on each column of the DataFrame.

In [11]:
Google_stock.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,3313.0,3313.0,3313.0,3313.0,3313.0,3313.0
mean,380.186092,383.49374,376.519309,380.072458,380.072458,8038476.0
std,223.81865,224.974534,222.473232,223.85378,223.85378,8399521.0
min,49.274517,50.541279,47.669952,49.681866,49.681866,7900.0
25%,226.556473,228.394516,224.003082,226.40744,226.40744,2584900.0
50%,293.312286,295.433502,289.929291,293.029114,293.029114,5281300.0
75%,536.650024,540.0,532.409973,536.690002,536.690002,10653700.0
max,992.0,997.210022,989.0,989.679993,989.679993,82768100.0


The `descibe()` method can also be applied on a single column.

In [13]:
Google_stock["Adj Close"].describe()

count    3313.000000
mean      380.072458
std       223.853780
min        49.681866
25%       226.407440
50%       293.029114
75%       536.690002
max       989.679993
Name: Adj Close, dtype: float64

You can also focus on one statistic using one of the many methods Pandas provide.

In [15]:
print()
print("Maximum values of each column: \n {}".format(Google_stock.max()))
print()
print("Minimum close value: \n {}".format(Google_stock["Close"].min()))
print()
print("Average values of each column: \n {}".format(Google_stock.mean()))


Maximum values of each column: 
 Date         2017-10-13
Open                992
High             997.21
Low                 989
Close            989.68
Adj Close        989.68
Volume         82768100
dtype: object

Minimum close value: 
 49.681866

Average values of each column: 
 Date        -0.000000e+00
Open         3.801861e+02
High         3.834937e+02
Low          3.765193e+02
Close        3.800725e+02
Adj Close    3.800725e+02
Volume       8.038476e+06
dtype: float64


Another important statistical measure is data correlation. It can tell us if data is correlated e.g. data in different columns.

The `corr()` method can be used to get correlation between different columns.

In [16]:
Google_stock.corr()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
Open,1.0,0.999904,0.999845,0.999745,0.999745,-0.564258
High,0.999904,1.0,0.999834,0.999868,0.999868,-0.562749
Low,0.999845,0.999834,1.0,0.999899,0.999899,-0.567007
Close,0.999745,0.999868,0.999899,1.0,1.0,-0.564967
Adj Close,0.999745,0.999868,0.999899,1.0,1.0,-0.564967
Volume,-0.564258,-0.562749,-0.567007,-0.564967,-0.564967,1.0


A correlation of 1 tells us that there is a high correlation while one of 0 tells us there is no correlation.

Pandas also has a `groupby()` method. It allows us to group data in different ways.

CANNOT FIND THE `fictitious_company.csv` FILE, SO THIS ENDS HERE!