
# **Introduction to Google Colab and Pandas**

In this notebook, you will become familiar with manipulating tabular data with the Python programming language, Google Colab and Pandas. But, what are these?:

* ***Python*** is a high-level, general-purpose programming language easy to learn and use. With Python you can collect, clean, integrate, analyze and visualize data, perform AI tasks such as training machine learning models, and much more.

* The document you are reading is a ***Notebook***. In this dynamic web-based document you can write not only text, but also Python code that can be executed.

* ***Colab*** or "Colaboratory" is Google's service to use notebooks. You can think of Colab as notebooks stored in Google Drive. With Colab you can easily write, run and share notebooks.

* ***Pandas*** is Python's Swiss Army knife for processing tabular data. Everything you can do in Microsoft Excel can be done with Pandas, but instead of interacting with a graphical user interface, you write lines of codes.

# First steps with Colab

The document you are reading is not a static web page, but an interactive environment, a **Colab notebook**, that lets you write and execute code.

A notebook is composed of cells, which can contain code, text, images, and more. For example, this is a **text cell** and here is a **code cell** with a short Python script that computes a value and prints the result:

Text cells use [**markdown**](https://www.markdownguide.org/cheat-sheet/), which allows to format text creating headers, emphasizing with bold and italic, inserting URL links, images, and more.

Code cells are independently executed and can have multiple lines of code:

In [None]:
# This is a comment within a code cell and is ignored when running the code

# Reading a CSV with Pandas

We will be working with the “California housing dataset” derived from the 1990 U.S. census. The columns of this dataset are:

* *MedInc*: median income in block group
* *HouseAge*: median house age in block group
* *AveRooms*: average number of rooms per household
* *AveBedrms*: average number of bedrooms per household
* *Population*: block group population
* *AveOccup*: average number of household members
* *Latitude*: block group latitude
* *Longitude*: block group longitude

The data is available in CSV format in the Colab session storage (see `sample_data` in the *Files* of the session). Let's **read** this **CSV** with Pandas. Think of it like opening the file with Microsoft Excel.

Something went wrong: an `error` occured. The reason is that before using Pandas, we need to import it. This is like adding a plugin or add-on in Chrome or Firefox which extend the basic funtionality with additional features. In the case of Pandas we are providing Python with capabilities to manipulate tabular data.

Importing pandas is as simple as:

That is it, Pandas is added and we can now **rerun** the failed code cell. Importing is only done once. **Restarting a session** requires importing pandas again.

Notice that code cells do **NOT** necessarily have to be **runned in order** of appearance, we can execute the `import pandas` cell before the `read_csv` cell although the order of appearance is the opposite.

Let's visualize the data:

The table that you are seeing is called a **DataFrame** and we stored it in the `df` variable.

[`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) is a method of Pandas (this is, something Pandas can do). It has many useful options like indicating separator of the file (`,`, `;`, `|`, ...).

# Working with a DataFrame

In [None]:
# Select a column

In [None]:
# Select more than one column

In [None]:
# Select rows

In [None]:
# Rename columns

In [None]:
# Remove columns

In [None]:
# Remove row

In [None]:
# Sort rows by the values of a column

In [None]:
# Median house values are in dollars $, get them in thousands of dollars (create new column)

In [None]:
# Separate median house values in low and high

In [None]:
# Remove rows with high median hose values

In [None]:
# Create a new row

We can read multiple formats into Pandas, such as CSV, Excel, SPSS, SAS, Parquet, STATA, and more. After we are done working with theh data, we can also export it to the format of our choice. For instance, save our data to an Excel file:

# Visualize data

In [None]:
# Create a histogram

In [None]:
# Create a scatter plot

# Explore the documentation of Pandas

Pandas comes with much functionality, it is not needed to learn all of them. Being familiar with the **[documentation](https://pandas.pydata.org/docs/reference/index.html)** allows you to check how to do new things.

In the documentation, functions come with examples that you can find below their description. You can start by exploring **[sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)** and **[replace](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html)**.

# Play with your own data!

In [None]:
# You can start by reading an Excel file from Google Drive