<a href="https://colab.research.google.com/github/millie-sky/Python-tutorials/blob/main/Tutorial_01_Setting_Up_Python_and_Understanding_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial 01: Setting Up Python and Understanding Pandas**



# Introduction
In this first tutorial, we will set you up with Python and familiarise you with one of its most important libraries, Pandas. By the end of this tutorial, you'll have the opportunity to run some Python code and learn some basic operations in Pandas. Let's get started!"

## What is Python?
Python is a high-level, interpreted programming language known for its simplicity and readability. It's a versatile language used for a wide range of applications, from web development to data science and artificial intelligence. Python's syntax is clear and intuitive, making it an excellent choice for beginners, yet it's powerful enough for use in complex, large-scale applications.

## Python Libraries: Extending Python's Capabilities
Python's true power lies in its vast ecosystem of libraries—collections of modules and functions that extend Python's capabilities. These libraries are designed for various purposes, including web development (Django, Flask), machine learning (scikit-learn, TensorFlow), and data analysis (numpy, pandas). By leveraging these libraries, we can perform complex tasks with relatively simple and concise Python code.


# Setting Up the Environment in Google Colab
In this section, we'll cover the initial steps required to start working with Python and pandas in Google Colab. Google Colab provides a Jupyter notebook environment that requires no setup on your part and offers free access to computing resources, making it an excellent platform for data analysis projects.

## What is Colab?

Colab, or ‘Colaboratory’, allows you to write and execute Python in your browser, with
- Zero configuration required
- Access to GPUs free of charge
- Easy sharing

Watch <a href="https://www.youtube.com/watch?v=inN8seMm7UI">Introduction to Colab</a> to find out more, or just get started below!

## Getting started

The document that you are reading is not a static web page, but an interactive environment called a <strong>Colab notebook</strong> that lets you write and execute code.

For example, here is a <strong>code cell</strong> with a short Python script that computes a value, stores it in a variable and prints the result:

In [1]:
seconds_in_a_day = 24 * 60 * 60
seconds_in_a_day

86400

To execute the code in the above cell, select it with a click and then either press the play button to the left of the code, or use the keyboard shortcut 'Command/Ctrl+Enter'. To edit the code, just click the cell and start editing.

Variables that you define in one cell can later be used in other cells:

In [2]:
seconds_in_a_week = 7 * seconds_in_a_day
seconds_in_a_week

604800

Colab notebooks allow you to combine <strong>executable code</strong> and <strong>rich text</strong> in a single document, along with <strong>images</strong>, <strong>HTML</strong>, <strong>LaTeX</strong> and more. When you create your own Colab notebooks, they are stored in your Google Drive account. You can easily share your Colab notebooks with co-workers, allowing them to comment on your notebooks or even edit them. To find out more, see <a href="/notebooks/basic_features_overview.ipynb">Overview of Colab</a>. To create a new Colab notebook you can use the File menu above, or use the following link: <a href="http://colab.research.google.com#create=true">Create a new Colab notebook</a>.

Colab notebooks are Jupyter notebooks that are hosted by Colab. To find out more about the Jupyter project, see <a href="https://www.jupyter.org">jupyter.org</a>.

# Basic Python Syntax and Concepts
Before diving into pandas, let's briefly go over some basic Python syntax and concepts. This will help you understand the foundational elements of Python programming.

### Variables
Variables in Python are created by simply assigning a value to a name. Unlike SQL, there's no need to declare a variable before using it.

In [3]:
# Assigning a value to a variable
number_of_students = 30
course_name = "Data Science 101"

### Data Types
Python automatically detects the data type of a variable. The basic types include integers, floats (decimal numbers), strings (text), and booleans (True/False).



In [4]:
student_count = 30  # Integer
course_rating = 4.5  # Float
course_name = "Data Science 101"  # String
course_active = True  # Boolean

### Lists
Lists in Python are equivalent to arrays in other languages and can hold a collection of items.

In [5]:
students = ["Alice", "Bob", "Charlie"] # list of strings
mixed_list = [5, 4, 4.5, "Bob" ] # list of different data types

### Loops
Loops in Python, such as the for loop, allow you to iterate over a sequence of values.

In [6]:
for student in students:
    print(student)

Alice
Bob
Charlie


### Conditional Statements
Conditional statements in Python, using if, elif, and else, allow you to execute different blocks of code based on certain conditions.

In [7]:
if course_rating > 4:
    print("Highly rated course")
else:
    print("This course has room for improvement")

Highly rated course


### Functions
Functions in Python are defined using the def keyword and are used to encapsulate reusable blocks of code.

In [8]:
def greet(student_name):
    return f"Welcome, {student_name}!"

print(greet("Alice"))

Welcome, Alice!


# Fundamentals of Pandas

## What is Pandas?

Pandas is an open-source library that provides high-performance, easy-to-use data structures and data analysis tools for Python. The name "pandas" is derived from "panel data," an econometrics term for multidimensional structured data sets.

Think of pandas as Python's answer to the power and flexibility of SQL for data manipulation, with added capabilities for handling a wider variety of data formats and for performing complex analyses and visualisations directly within Python. Its primary data structure, the DataFrame, is intuitively similar to a SQL table in how they organize data in rows and columns but with much more flexibility.

**To use this library, let's first import it: `import pandas as pd`**

In [9]:
# Importing the library
import pandas as pd

## Core Data Structures

### Series
A Series is a one-dimensional array-like object containing a sequence of values (similar to a list in Python) and an associated array of data labels, called its index. Think of it as a single column of a table.

**Creating a Series**

In [10]:
# Creating a Series from a list
data_list = [10, 20, 30, 40, 50]
series_from_list = pd.Series(data_list)

print("Series from list:")
print(series_from_list)

Series from list:
0    10
1    20
2    30
3    40
4    50
dtype: int64


In [11]:
# Creating a Series from a dictionary
data_dict = {'A': 100, 'B': 200, 'C': 300}
series_from_dict = pd.Series(data_dict)

print("\nSeries from dictionary:")
print(series_from_dict)


Series from dictionary:
A    100
B    200
C    300
dtype: int64


**Accessing Elements**

In [12]:
# Accessing elements by index label
print("Accessing element at index label 'B':", series_from_dict['B'])

Accessing element at index label 'B': 200


In [13]:
# Accessing elements by integer position
print("Accessing element at integer position 2:", series_from_list.iloc[2])

Accessing element at integer position 2: 30


**Attributes and Methods**

In [14]:
# Attribute: index
print("Index of the Series:", series_from_list.index)

Index of the Series: RangeIndex(start=0, stop=5, step=1)


In [15]:
# Method: head
print("First few elements of the Series:")
print(series_from_list.head())

First few elements of the Series:
0    10
1    20
2    30
3    40
4    50
dtype: int64


In [16]:
# Method: describe
print("Summary statistics of the Series:")
print(series_from_list.describe())

Summary statistics of the Series:
count     5.000000
mean     30.000000
std      15.811388
min      10.000000
25%      20.000000
50%      30.000000
75%      40.000000
max      50.000000
dtype: float64


**Operations**

In [17]:
# Arithmetic operations
result_series = series_from_list * 2
print("Result of multiplying the Series by 2:")
print(result_series)

Result of multiplying the Series by 2:
0     20
1     40
2     60
3     80
4    100
dtype: int64


In [18]:
# Element-wise operations
result_series = series_from_list + series_from_list
print("Result of adding the Series to itself element-wise:")
print(result_series)

Result of adding the Series to itself element-wise:
0     20
1     40
2     60
3     80
4    100
dtype: int64


**Handling Missing Data**

In [19]:
# Creating a Series with missing values
data_with_missing = [10, None, 30, None, 50]
series_with_missing = pd.Series(data_with_missing)

print("Series with missing values:")
print(series_with_missing)

Series with missing values:
0    10.0
1     NaN
2    30.0
3     NaN
4    50.0
dtype: float64


In [20]:
# Dropping missing values
series_without_missing = series_with_missing.dropna()

print("\nSeries without missing values:")
print(series_without_missing)


Series without missing values:
0    10.0
2    30.0
4    50.0
dtype: float64


### DataFrame
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is essentially a spreadsheet or SQL table in Python.

**Creating a DataFrame**

In [21]:
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

print("DataFrame:")
print(df)

DataFrame:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


**Viewing DataFrame**

In [22]:
# Viewing the first few rows of the DataFrame
print("First few rows:")
print(df.head())

First few rows:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


In [23]:
# Viewing summary statistics of the DataFrame
print("Summary statistics:")
print(df.describe())

Summary statistics:
        Age
count   3.0
mean   30.0
std     5.0
min    25.0
25%    27.5
50%    30.0
75%    32.5
max    35.0


In [24]:
# Using print(df) will display the DataFrame as plain text output, or you could directly call `df`
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


## Basic DataFrame Operations
Indexing, Selection, and Filtering
DataFrames allow for indexing and selection of data in a way that's similar to SQL but more flexible.

**Selecting Columns**

In [25]:
# Selecting a single column
print("Selecting 'Name' column:")
print(df['Name'])

Selecting 'Name' column:
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object


In [26]:
# Selecting multiple columns
print("Selecting 'Name' and 'Age' columns:")
print(df[['Name', 'Age']])

Selecting 'Name' and 'Age' columns:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


**Filtering Rows**

In [27]:
# Filtering rows based on a condition
print("Filtering rows where Age is greater than 25:")
print(df[df['Age'] > 25])

Filtering rows where Age is greater than 25:
      Name  Age         City
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


**Sorting DataFrame**

In [28]:
# Sorting DataFrame by a column
print("Sorting DataFrame by 'Age' column:")
print(df.sort_values(by='Age'))

Sorting DataFrame by 'Age' column:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


**Grouping and Aggregating**

In [29]:
# Grouping DataFrame by a column and aggregating
grouped_df = df.groupby('City').mean()
print("Grouping by 'City' column and calculating mean:")
print(grouped_df)

Grouping by 'City' column and calculating mean:
              Age
City             
Chicago      35.0
Los Angeles  30.0
New York     25.0


  grouped_df = df.groupby('City').mean()


**Joining DataFrames**

In [30]:
# Creating another DataFrame
data2 = {'City': ['New York', 'Los Angeles', 'Chicago'],
         'Population': [8.4, 3.9, 2.7]}
df2 = pd.DataFrame(data2)

# Inner join with another DataFrame
merged_df = pd.merge(df, df2, on='City')
print("Inner join with another DataFrame:")
print(merged_df)

Inner join with another DataFrame:
      Name  Age         City  Population
0    Alice   25     New York         8.4
1      Bob   30  Los Angeles         3.9
2  Charlie   35      Chicago         2.7


**Row and Column Selection**

In [31]:
# Select rows 0 to 2 (inclusive) and all columns
print(df.loc[0:2, :])

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


In [32]:
# Select rows by index and columns by name
print(df.loc[0:2, ['Name', 'Age']])

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


## Reading and Writing Data
Pandas supports a variety of file formats, making it easy to read data from and write data to different sources.

**Reading Data**

To read data in Python using Google Colab, you can follow these steps:

1. Upload Data: First, you need to upload your data files to Google Colab. You can do this by clicking on the "Files" tab on the left sidebar, then clicking on the "Upload" button and selecting the files you want to upload from your local machine.

2. Mount Google Drive (Optional): If your data is stored in Google Drive, you can mount your Google Drive in Colab using the following code:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

This will prompt you to authenticate and give Colab access to your Google Drive. Once mounted, you can access your files using the path /content/drive/.

Now that your data is uploaded or accessible, you can use pandas to read the data into your Colab notebook. For example, if you have a CSV file named `data.csv`, you can read it into a pandas DataFrame using the following code. Make sure that the CSV file is located in the same directory as your Colab notebook, or provide the full path to the CSV file if it's located elsewhere. This code will read the CSV file into the DataFrame `df`.

In [None]:
#### Notice: Below code are examples, if you run them without having the files avaliable you will get errors

# Reading a CSV file
# Assuming 'data.csv' is located in '/content/drive/MyDrive/datasets/'
file_path = '/content/drive/MyDrive/datasets/data.csv'
df = pd.read_csv(file_path)

# If 'data.csv' is located in the same directory as your Colab notebook you can directly read it
df = pd.read_csv('data.csv')

# Reading an Excel file
df = pd.read_excel('data.xlsx')

**Writing Data**

In [None]:
# Assuming you have a DataFrame called 'df' that you want to write to CSV
df.to_csv('output.csv', index=False)

This will write the DataFrame data to a CSV file named `data.csv` in the current directory of your Colab environment. Setting index=False ensures that the DataFrame index is not included in the CSV file.

In [None]:
# Assuming you have a DataFrame called 'df' that you want to write to Excel
df.to_excel('output.xlsx', index=False)

In [None]:
# Or you can define an output path
output_file_path = '/content/drive/MyDrive/output/output.csv'

# Assuming 'df' is your DataFrame and 'output_file_path' is the desired path
df.to_csv(output_file_path, index=False)

<div class="markdown-google-sans">

# More resources

### Working with notebooks in Colab

</div>

- [Overview of Colaboratory](/notebooks/basic_features_overview.ipynb)
- [Guide to markdown](/notebooks/markdown_guide.ipynb)
- [Importing libraries and installing dependencies](/notebooks/snippets/importing_libraries.ipynb)
- [Saving and loading notebooks in GitHub](https://colab.research.google.com/github/googlecolab/colabtools/blob/main/notebooks/colab-github-demo.ipynb)
- [Interactive forms](/notebooks/forms.ipynb)
- [Interactive widgets](/notebooks/widgets.ipynb)

<div class="markdown-google-sans">

<a name="working-with-data"></a>
### Working with data
</div>

- [Loading data: Drive, Sheets and Google Cloud Storage](/notebooks/io.ipynb)
- [Charts: visualising data](/notebooks/charts.ipynb)
- [Intro to Pandas DataFrame](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb)
