<a href="https://colab.research.google.com/github/michalis0/Business-Intelligence-and-Analytics/blob/master/2%20-%20Pandas%20and%20Python/walkthroughs/Basic_Pandas_Load_File.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Pandas operations

**Goal**: Our goal here is to learn how to load a dataset into a Pandas DataFrame. The dataset can come either in CSV or in JSON format. We will see also how to perform basic data manipulations and very basic data visualizations so that you understand the nature of your data.

## 1. Loading a dataset in CSV format

First you have to import the `pandas` package.

In [None]:
import pandas as pd # press shift+enter to execute it

Now you can see that you autocomplete your code with functions that are included in `pandas`. Eg type `pd.read` and see that it recommends some functions:


In [None]:
# let's load a file

url = "https://media.githubusercontent.com/media/michalis0/Business-Intelligence-and-Analytics/master/data/pandas_tutorial_read.csv"
data = pd.read_csv(url) 
data.head()

Unnamed: 0,01.01.2018 00:01;read;country_7;2458151261;SEO;North America
0,01.01.2018 00:03;read;country_7;2458151262;SEO...
1,01.01.2018 00:04;read;country_7;2458151263;AdW...
2,01.01.2018 00:04;read;country_7;2458151264;AdW...
3,01.01.2018 00:05;read;country_8;2458151265;Red...
4,01.01.2018 00:05;read;country_6;2458151266;Red...


Is the above correct? Most likely not. 

CSV stands for "Comma Separated Values", which is not the case here. We see there are ';' and the data seem to be in one column. The default delimiter is ',' so we need to change it.

In [None]:
data = pd.read_csv(url, delimiter=';') 
data.head()

Unnamed: 0,01.01.2018 00:01,read,country_7,2458151261,SEO,North America
0,01.01.2018 00:03,read,country_7,2458151262,SEO,South America
1,01.01.2018 00:04,read,country_7,2458151263,AdWords,Africa
2,01.01.2018 00:04,read,country_7,2458151264,AdWords,Europe
3,01.01.2018 00:05,read,country_8,2458151265,Reddit,North America
4,01.01.2018 00:05,read,country_6,2458151266,Reddit,North America


This looks better. But something else does not look good now. Our DataFrame is missing a header, let's assign one:

In [None]:
data = pd.read_csv(url, 
                   delimiter=';', 
                   names = ['my_datetime', 'event', 'country', 'user_id', 'source', 'topic']) 
data.head()


Unnamed: 0,my_datetime,event,country,user_id,source,topic
0,01.01.2018 00:01,read,country_7,2458151261,SEO,North America
1,01.01.2018 00:03,read,country_7,2458151262,SEO,South America
2,01.01.2018 00:04,read,country_7,2458151263,AdWords,Africa
3,01.01.2018 00:04,read,country_7,2458151264,AdWords,Europe
4,01.01.2018 00:05,read,country_8,2458151265,Reddit,North America


With the `head` function you see the first few lines. The .shape function shows the shape of the dataframe (rows, columns). You can also see the:

- whole dataset: just type ```data```
- the beginning as before ```data.head()``` 
- the last 5 entries ```data.tail()``` or 
- a sample such as ```data.sample(5)```
- some descriptive statistics ```data.describe()```
Try it out below:

In [None]:
data.shape

(1795, 6)

In [None]:
#Use this cell to write your own code

## DataFrame components
There are three components of the DataFrame: the index, columns and data (values). We can extract each of these components into their own variables. Let's do that and then inspect them:

In [None]:
index = data.index
columns = data.columns
values = data.values

In [None]:
index

RangeIndex(start=0, stop=1795, step=1)

In [None]:
columns

Index(['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'], dtype='object')

In [None]:
values

array([['01.01.2018 00:01', 'read', 'country_7', '2458151261', 'SEO',
        'North America'],
       ['01.01.2018 00:03', 'read', 'country_7', '2458151262', 'SEO',
        'South America'],
       ['01.01.2018 00:04', 'read', 'country_7', '2458151263', 'AdWords',
        'Africa'],
       ...,
       ['01.01.2018 23:59', 'read', 'country_6', '2458153053', 'Reddit',
        'Asia'],
       ['01.01.2018 23:59', 'read', 'country_7', '2458153054', 'AdWords',
        'Europe'],
       ['01.01.2018 23:59', 'read', 'country_5', '2458153055', 'Reddit',
        'Asia']], dtype=object)

## Data types of the components

In [None]:
type(index)

pandas.core.indexes.range.RangeIndex

In [None]:
type(columns)

pandas.core.indexes.base.Index

In [None]:
type(values)

numpy.ndarray


The index and the columns are the same type: a pandas **`Index`** object (**`RangeIndex`** is of type **`Index`**), which is a sequence of labels for either the rows or the columns.

The values are a NumPy **`ndarray`**, which stands for n-dimensional array, and is the primary container of data in the NumPy library. Pandas is built directly on top of NumPy.

### Selecting columns

If you want to select two particular columns you can do it like that:

```data[['country', 'user_id']]``` 

or you can take the columns in a different order: 

```data[['user_id', 'country']]```.

The way to remember the syntax is that outer brackets signify that you want to select columns, and the inner brackets are for the list itself.

Try it out.

The above returns a pandas.DataFrame. If you want to return a pandas.Series instead then you can use this syntax:

```data.user_id ```

or 

``` data['user_id'] ```

If you want to filter one those users that came from SEO then you can write:

``` data[data.source == 'SEO'] ```

where the inner statement will create a boolean mask.

### Chaining

You can combine multiple selection methods as follows:

``` data.head()[['country', 'user_id']] ```

**CAUTION**: A thing to keep in mind is that when you use chaining you work on *copies* of the original DataFrame. So if you use chaining to change data, you may observe that the original DataFrame was not changed.

#### Exercice 1:
Now it's your turn to solve an exercise and deepen your knowledge.

Select the user_id, the country and the topic columns for the users who are from country_2 and show only the first 10 results

In [None]:
# enter your solution here.

In [None]:
# possible solution (uncomment the code)
# data[['user_id', 'topic', 'country']][data['country'] == 'country_2'].head(10)

### Column Types

Remember that each column/attribute of our data may have different attribute types. Have a look at the following table to understand the different DataTypes in Pandas and Python.

| Pandas dtype  | Python type  | NumPy type|Usage
| :--- | :--- | :--- | :--- |
| object| str or mixed | string_, unicode_, mixed types| Text or mixed numeric and non_numeric values |
| int 64| int| int_, int8, int16, int32, int64, uint8, uint18, uint32, uint64 | Integer numbers i.e. [1,2,3,...] |
| float64| float| float_, float15, float32, float64 | Floating point numbers (They contain decimal points) |
| bool| bool|bool_| True/False values|
| datetime64 | NA | datetime64[ns]     | Date and time values  |
| timedelta[ns] | NA  | NA| Differences between two datetime|
| category | NA| NA| Finite list of text values|

Although Pandas will correctly infer the type of data most of the times, it is important to check our columns and convert them if needed. Otherwise our results might get messed up.

The function `.dtypes` allows us to see the type of each column. Run the following cell to see how it works:

In [None]:
data.dtypes

my_datetime    object
event          object
country        object
user_id        object
source         object
topic          object
dtype: object

We can see that altough most columns seem to have the correct type, we can still improve our DataFrame. Let's convert the ["my_datetime"] column to the "datetime64" type, so that Pandas knows this column contains Date and Time values.

For this, the [.to_datetime()](https://https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function might come in handy.


In [None]:
# First, let's convert the column type
data["my_datetime"] = pd.to_datetime(data["my_datetime"]) 

#Now let's have a look at the dtypes once again
data.dtypes

my_datetime    datetime64[ns]
event                  object
country                object
user_id                object
source                 object
topic                  object
dtype: object

The user_id column seems to contain integers. Pandas did not correctly infer it since some of the rows in this column are corrupted (they contain other values than integers). 

However, we can still use the `pd.to_numeric` function to convert the column to the correct type. We can use the `errors = 'coerce' argument to handle these errors. This will set all cells in the ["user_id"] column containing corrupted entries to NaN (Not a Number). 

Other possible values for "errors=" are:
 * `errors = raise` This will raise an error if it encounters an invalid entry
 * `errors = ignore` This will return the original data (the data we passed as an argument) if it encounters an error. No conversion will take place.

In [None]:
#Let's convert the column type 
data["user_id"] = pd.to_numeric(data["user_id"], errors='coerce')

#Now let's have a look at the dtypes once again
data.dtypes


my_datetime    datetime64[ns]
event                  object
country                object
user_id               float64
source                 object
topic                  object
dtype: object

Since the number of possible countries is limited, we can convert that column to the categorical type with the `.astype()`function: 

In [None]:
#Change dtype
data['country'] = data['country'].astype('category')

#Now let's have a look at the dtypes once again
data.dtypes



my_datetime    datetime64[ns]
event                  object
country              category
user_id               float64
source                 object
topic                  object
dtype: object

### Dropping missing values

Since our column contained some corrupted data, those cells were set to NaN. 
The function `.isnull()` will return a DataFrame with values:
 * False for cells where the data is not null
 * True for cells where the data is null

We can then use `.sum()` to sum all values where the latter is True.

In [None]:
#Let's see how many column values are missing
data.isnull().sum()

my_datetime    0
event          0
country        0
user_id        4
source         0
topic          0
dtype: int64

We can now remove the missing values with the `.dropna()` function.
If no argument is passed, this function will drop all rows where at least one of the values is missing.

In [None]:
#We first drop all the rows
data = data.dropna()

#Let's have another look
data.isnull().sum()


my_datetime    0
event          0
country        0
user_id        0
source         0
topic          0
dtype: int64

---
## 2. Loading JSON files

Many of the data in the Internet exists in JSON format which is a structured text format, and is very similar to a Python dictionary.

We will see how to load a JSON dataset in a Pandas DataFrame.

We will use the Citibike API that provides a real-time view of the Citibike stations in New York.
The API call at http://www.citibikenyc.com/stations/json.

In [None]:
import requests
url = 'http://www.citibikenyc.com/stations/json'
data = requests.get(url).json()
data

Above you see how the JSON file looks. The JSON results contain two keys: The `executionTime` and `stationBeanList`. The `stationBeanList` is a list of dictionaries, with each dictionary corresponding to a Citibike station.

In [None]:
data.keys()

dict_keys(['executionTime', 'stationBeanList'])

With Pandas we can easily convert a list of dictionaries into a DataFrame

In [None]:
import pandas
df = pandas.DataFrame(data["stationBeanList"])
df.head(5)

Unnamed: 0,id,stationName,availableDocks,totalDocks,latitude,longitude,statusValue,statusKey,availableBikes,stAddress1,stAddress2,city,postalCode,location,altitude,testStation,lastCommunicationTime,landMark
0,72,W 52 St & 11 Ave,32,39,40.767272,-73.993929,In Service,1,7,W 52 St & 11 Ave,,,,,,False,2016-01-22 04:30:15 PM,
1,79,Franklin St & W Broadway,0,33,40.719116,-74.006667,In Service,1,33,Franklin St & W Broadway,,,,,,False,2016-01-22 04:32:41 PM,
2,82,St James Pl & Pearl St,27,27,40.711174,-74.000165,In Service,1,0,St James Pl & Pearl St,,,,,,False,2016-01-22 04:29:41 PM,
3,83,Atlantic Ave & Fort Greene Pl,21,62,40.683826,-73.976323,In Service,1,40,Atlantic Ave & Fort Greene Pl,,,,,,False,2016-01-22 04:32:33 PM,
4,116,W 17 St & 8 Ave,19,39,40.741776,-74.001497,In Service,1,19,W 17 St & 8 Ave,,,,,,False,2016-01-22 04:32:32 PM,


Let's try to understand the columns:

We notice that:

- **totalDocks** = **availableBikes** (bikes ready to rent) + **availableDocks** (how many docks are free)

To see if the data has been imported correctly, we can verify the datatypes of the columns.

In [None]:
df.dtypes

id                         int64
stationName               object
availableDocks             int64
totalDocks                 int64
latitude                 float64
longitude                float64
statusValue               object
statusKey                  int64
availableBikes             int64
stAddress1                object
stAddress2                object
city                      object
postalCode                object
location                  object
altitude                  object
testStation                 bool
lastCommunicationTime     object
landMark                  object
dtype: object

One column that looks not parsed correctly is the **lastCommunicationTime** which is an `object` (i.e., `string`), so you may want to convert it to the `datetime` type.

<div class="alert alert-block alert-success">
    <h2>Exercise 2:</h2>


    
>Convert the **lastCommunicationTime** into a `datetime` datatype. <br>
**Hint**: Use the [pandas.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function.
</div>

In [None]:
# Your solution here

In [None]:
# df["lastCommunicationTime"] = pd.to_datetime(df["lastCommunicationTime"])
# df.head()

Let's confirm that the **lastCommunicationTime** column is of type `datetime`.

### Adding a column

We can add a column `perc_full` that shows how full is each station.

In [None]:
df["perc_full"] = df['availableBikes']/df['totalDocks']
df.head()

Unnamed: 0,id,stationName,availableDocks,totalDocks,latitude,longitude,statusValue,statusKey,availableBikes,stAddress1,stAddress2,city,postalCode,location,altitude,testStation,lastCommunicationTime,landMark,perc_full
0,72,W 52 St & 11 Ave,32,39,40.767272,-73.993929,In Service,1,7,W 52 St & 11 Ave,,,,,,False,2016-01-22 04:30:15 PM,,0.179487
1,79,Franklin St & W Broadway,0,33,40.719116,-74.006667,In Service,1,33,Franklin St & W Broadway,,,,,,False,2016-01-22 04:32:41 PM,,1.0
2,82,St James Pl & Pearl St,27,27,40.711174,-74.000165,In Service,1,0,St James Pl & Pearl St,,,,,,False,2016-01-22 04:29:41 PM,,0.0
3,83,Atlantic Ave & Fort Greene Pl,21,62,40.683826,-73.976323,In Service,1,40,Atlantic Ave & Fort Greene Pl,,,,,,False,2016-01-22 04:32:33 PM,,0.645161
4,116,W 17 St & 8 Ave,19,39,40.741776,-74.001497,In Service,1,19,W 17 St & 8 Ave,,,,,,False,2016-01-22 04:32:32 PM,,0.487179


### `describe()`, `mean()`, `groupby()`, `where()`, and `sort_values()`
It is often interesting to find simple statistics about a column like the mean, or the highest values. the functions named above are very helpful to do that.

In [None]:
#The describe() function gives you some descriptive statistics about your DF
df.describe()

Unnamed: 0,id,availableDocks,totalDocks,latitude,longitude,statusKey,availableBikes,perc_full
count,509.0,509.0,509.0,509.0,509.0,509.0,509.0,506.0
mean,1438.563851,21.502947,32.779961,40.728369,-73.983516,1.015717,10.534381,0.319718
std,1334.345344,12.749035,11.3979,0.026755,0.028253,0.176773,10.171236,0.277337
min,72.0,0.0,0.0,40.678907,-74.096937,1.0,0.0,0.0
25%,346.0,12.0,24.0,40.708771,-73.997044,1.0,2.0,0.08027
50%,488.0,20.0,31.0,40.725603,-73.982614,1.0,7.0,0.255656
75%,3102.0,28.0,39.0,40.749156,-73.962644,1.0,16.0,0.510993
max,3244.0,62.0,67.0,40.787209,-73.929891,3.0,46.0,1.0


First, let's suppose we want to find out how full the stations are on average (take the average of `perc_full`).

In [None]:
df['perc_full'].mean()

0.3197181112661129

There are two types of stations, the ones that are in service and the ones that are not. We can use a `.groupby()` argument to find out how full the stations are on average, depending on their 'statusValue'.

In [None]:
df.groupby('statusValue')['perc_full'].mean()

statusValue
In Service        0.320987
Not In Service    0.000000
Name: perc_full, dtype: float64

Let's say we are interested in knowing how full the stations with more that 50 docks are on average.

In [None]:
df.where(df.totalDocks > 50)['perc_full'].mean()

0.21945010834302064

Sometimes it is also useful to simply sort the values. When using the `sort_values()` function, it is crucial to indicated by which column the values should be sorted using the argument `by='column'`. Additionally, you need to specify if the values should be sorted highest first, or lowest first.

- `sort_values(by='column', ascending=False)`: Will sort from highest to lowest.
- `sort_values(by='column', ascending=True)`: Will sort from lowest to highest.

For example, we can get the 3 stations that are the fullest.

In [None]:
df.sort_values(by='perc_full', ascending=False).head(3)

Unnamed: 0,id,stationName,availableDocks,totalDocks,latitude,longitude,statusValue,statusKey,availableBikes,stAddress1,stAddress2,city,postalCode,location,altitude,testStation,lastCommunicationTime,landMark,perc_full
405,3126,44 Dr & Jackson Ave,0,34,40.747182,-73.943264,In Service,1,34,44 Dr & Jackson Ave,,,,,,False,2016-01-22 04:29:33 PM,,1.0
124,343,Clinton Ave & Flushing Ave,0,23,40.69794,-73.969868,In Service,1,23,Clinton Ave & Flushing Ave,,,,,,False,2016-01-22 04:29:40 PM,,1.0
369,3090,N 8 St & Driggs Ave,0,31,40.717746,-73.956001,In Service,1,31,N 8 St & Driggs Ave,,,,,,False,2016-01-22 04:30:49 PM,,1.0


We can see that these stations are all totally full (`perc_full` = 1.0). Let's say we then want to sort them by the amount of available bikes. To do this, simply specify a list of columns.

In [None]:
df.sort_values(by=['perc_full', 'availableBikes'], ascending=False).head(3)

Unnamed: 0,id,stationName,availableDocks,totalDocks,latitude,longitude,statusValue,statusKey,availableBikes,stAddress1,stAddress2,city,postalCode,location,altitude,testStation,lastCommunicationTime,landMark,perc_full
405,3126,44 Dr & Jackson Ave,0,34,40.747182,-73.943264,In Service,1,34,44 Dr & Jackson Ave,,,,,,False,2016-01-22 04:29:33 PM,,1.0
1,79,Franklin St & W Broadway,0,33,40.719116,-74.006667,In Service,1,33,Franklin St & W Broadway,,,,,,False,2016-01-22 04:32:41 PM,,1.0
369,3090,N 8 St & Driggs Ave,0,31,40.717746,-73.956001,In Service,1,31,N 8 St & Driggs Ave,,,,,,False,2016-01-22 04:30:49 PM,,1.0


**Note:** If we would have like to sort by `perc_full` in descending order, but by `availableBikes` in ascending order, we would have used `ascending=[False, True]`.

---
## Writing the data to a CSV

With the above, we just scratched the surface of what it means to do data processing.

After you did your basic data processing, you may want to save the DataFrame in a new CSV file, so that you don't have to repeat the same pre-processing everytime. You can use the [to_csv](https://datatofish.com/export-dataframe-to-csv/) function.

**Note**: When you use Google Colab, this file will only be saved in your temporary virtual machine space and will be deleted once your Colab instance is closed (i.e. you close the window). If you want to explore more permanent solutions of saving your file look [here](https://colab.research.google.com/notebooks/io.ipynb).


In [None]:
# uncomment the following to save the file
# df.to_csv("my_new_file.csv", sep=",")