# First steps with Pandas for data analysis

In this tutorial, we'll use the Python programming language and its Pandas library to analyze tabular data, i.e., any set of information stored as a table, where:

- **Columns** indicate **features**.
- **Rows** indicate **samples**.

Let's use Pandas to analyze a dataset about fuel prices in Brazil, available [here](https://www.kaggle.com/ficosta/combustible-price-brasil/download). This dataset is store as a `.csv` file (comma-separated values), one of the most used tabular data formats. Next we'll explore the data and see what information it can provide us.

Download the `.zip` file from the link above, unzip it to your computer, and upload the `.csv` file to your Google Drive. Next, run the cell below and follow the instructions to access your Drive files from this Colab.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

## Importing pandas

First, we need to import some resources from the Pandas package to the Colab environment. To import some resource in Python, we must inform which library the resource is coming from. In this example, we will import the `read_csv` resource, a **method** that reads tabular data stored as a `.csv`.

1 - Uncomment the code below (delete the \# symbol) and press play at the left side of the cell (or use ***shift + enter***).

In [0]:
# from pandas import read_csv

## Reading a CSV file

Define a name for your dataset (e.g. *data*). The method `read_csv()` reads the CSV file, as follows:

```
data = read_csv('/path/to/your/file.csv')
```
In the example above, the name *data* is associated to a Pandas object. We'll talk about this later.

Obs.: To fetch the file path in Colab, navigate to your file using the folders given on the left panel, right-click the file and choose ***Copy path***.

2 - Create a name for your dataset and read the CSV file using the `read_csv` method.

## Exploring the data

After we read the data, we need to find out which **features** it presents and how many **samples** it has. Pandas presents various methods to get information about a dataset. To find out the number of features and samples in a dataset, we can check the `shape` attribute. 

In the previous example, you created a Pandas object called `data`. In Python, we check the atributes from an object as follows:

```
data.attribute_name
```

3.1 - Find the number of samples and features of the dataset using the `shape` attribute.

In [0]:
# data.shape

It's possible to view the first samples in a dataset using the method `head(n_lines)`. In Python, we refer to attributes and methods in the same way, but in the case of methods we append parentheses to its name. `n_lines` is a positive number that, if passed on as an `argument`, defines how many samples will be displayed.

**Observation**: If you choose not to provide `n_lines`, Python uses 5 as default.

3.2 - Using the method `head()`, display the first samples of the dataset. Does the number of features match what the `shape` attribute had indicated?

In [0]:
# data.head(2)

Note that, in the table above, the first sample presents index `0` (zero). Therefore, the index of the last sample in the dataset equals the number of samples minus one. To display the last samples, use the method `tail()`.

3.3 - Use the method `tail()` to display the last samples in the dataset.

Since the dataset is provided in Brazilian Portuguese, we will translate the name of the features to help understand the data. 

We can do this fiddling with the attribute `columns` of the `DataFrame`:

In [0]:
# data.columns

Pandas response is a bit verbose (polluted), but the part that matters to us is the list of column names.

In Python, a list is represented by the notation `[element_1, element_2, ..., element_n]`:

```python3
['DATA INICIAL', 'DATA FINAL', 'PRODUTO', 'NÚMERO DE POSTOS PESQUISADOS',
       'UNIDADE DE MEDIDA', 'PREÇO MÉDIO REVENDA', 'DESVIO PADRÃO REVENDA',
       'PREÇO MÍNIMO REVENDA', 'PREÇO MÁXIMO REVENDA', 'MARGEM MÉDIA REVENDA',
       'COEF DE VARIAÇÃO REVENDA', 'PREÇO MÉDIO DISTRIBUIÇÃO',
       'DESVIO PADRÃO DISTRIBUIÇÃO', 'PREÇO MÍNIMO DISTRIBUIÇÃO',
       'PREÇO MÁXIMO DISTRIBUIÇÃO', 'COEF DE VARIAÇÃO DISTRIBUIÇÃO']
```

We can replace the list of feature names associating the attribute `columns` to a new list:

In [0]:
data.columns = [
                "start_date",
                "end_date",
                "product",
                "n_stations",
                "metric",
                "average_retail_price",
                "stddev_retail_price",
                "min_retail_price",
                "max_retail_price",
                "avg_retail_margin",
                "retail_variance",
                "avg_distribution_price",
                "stddev_distribution_price",
                "min_distribution_price",
                "max_distribution_price",
                "distribution_variance"
                ]

## Pandas objects and data types

Pandas two main data containers:
- **Series:** represents a unidimensional data series of the same type (numbers, names, ages, etc.).
- **DataFrame:** collection of series, where each series presents its own type.

The Pandas object we create with the `read_csv()` method is a `DataFrame`, since the `.csv` file presents several features. Each column in that file is interpreted by Pandas as a `Series`. We can check the type of an object using the method `type()`.

In addition, `DataFrame` objects present an `info()` method, which summarizes the data types used to represent features and the number of not null values in each feature, to cite a couple.

4.1 - Use the `type()` method to check the type of the object to which the name `data` refers:

In [0]:
# type(data)

4.2 - Use the `info()` method to check for a summary of your `DataFrame`:

When we analyze data, it's important to make sure that the correct data types are being used, since some operations can only be performed for certain data types. Pandas is able to infer some data types, but it's important to double-check the automatically selected types. For more info on Pandas data types, check this [link](https://pbpython.com/pandas_dtypes.html).





Pandas main data types are the following:

| Type       | Description      |
|------------|------------------|
| object     | Text format     |
| int64      | Integer numbers  |
| float64    | Decimal numbers  |
| datetime64 | Date and time    |
| bool       | True or False    |

It's important to notice that dataframe features incorrectly represented will prevent their proper analysis. In this case, we need to convert the data type to the correct one. In Pandas, the attribute `dtypes` gives the data types for all features in a `DataFrame`:

4.2 - Use the `dtypes` attribute to check the data types of the dataframe:

In [0]:
data.dtypes

It's possible to convert data types in Pandas using the `astype()` method. For instance, we can change the type of a feature in a `DataFrame` named `data` as follows:

```python
data['feature'] = data['feature'].astype('new_type')
```

**4.3** - Use the `astype()` method to change the type of feature 'avg_retail_price' to `float64`.

In [0]:
# data['avg_retail_price'] = data['avg_retail_price'].astype('float64')

Running the cell above must have produced the following error:

```python3
ValueError: could not convert string to float: '1,948'
```

Python produces this error because the original dataset used commas rather than periods to indicate decimal values. This way, we would need to fix the type of each feature of our `DataFrame`.

Fortunately, the `read_csv()` method presents a number of arguments, one of which indicates to Pandas that the dataset being read is using commas instead of periods:
```python
data = read_csv('/content/SEMANAL_BRASIL-DESDE_2013.csv', decimal=',')
```

4.4 - Read the CSV file again and name it `data2`:

- tell the `read_csv()` method to use the comma as decimal value separator, like in the example above.
- check the `dtypes` attribute to verify the data types inferred by Pandas.

In [0]:
# use the read_csv() method

In [0]:
# check the dtypes attribute