<a href="https://colab.research.google.com/github/millie-sky/Python-tutorials/blob/main/Tutorial_01_Bridging_SQL_to_Python_with_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial 1: Bridging SQL to Python with Pandas**

Welcome to this tutorial, it is designed to help SQL users like you make a seamless transition into the world of Python. Whether you're analysing data, automating tasks, or building applications, understanding how to leverage Python alongside SQL can significantly broaden your toolkit and enhance your capabilities.


# Introduction

## What is Python?
Python is a high-level, interpreted programming language known for its simplicity and readability. It's a versatile language used for a wide range of applications, from web development to data science and artificial intelligence. Python's syntax is clear and intuitive, making it an excellent choice for beginners, yet it's powerful enough for use in complex, large-scale applications.

## Python vs. SQL: A Comparison
While SQL (Structured Query Language) is a domain-specific language used primarily for managing and manipulating relational databases, Python is a general-purpose programming language. SQL excels in querying and manipulating data stored in a structured database format. In contrast, Python offers a broader scope of applications, including but not limited to data manipulation.

One of Python's strengths is its ability to handle data that is not only structured (like SQL databases) but also unstructured or semi-structured data (like text files, JSON, and XML). This makes Python an indispensable tool in the data scientist's toolbox, especially when dealing with diverse data sources or when data needs to be preprocessed, analyzed, and visualized in complex ways that go beyond SQL's capabilities.

## Python Libraries: Extending Python's Capabilities
Python's true power lies in its vast ecosystem of libraries—collections of modules and functions that extend Python's capabilities. These libraries are designed for various purposes, including web development (Django, Flask), machine learning (scikit-learn, TensorFlow), and data analysis (numpy, pandas). By leveraging these libraries, we can perform complex tasks with relatively simple and concise Python code.

## What is Pandas?
Think of pandas as Python's answer to the power and flexibility of SQL for data manipulation, with added capabilities for handling a wider variety of data formats and for performing complex analyses and visualisations directly within Python. Its primary data structure, the DataFrame, is intuitively similar to a SQL table in how they organize data in rows and columns but with much more flexibility.

By the end of this tutorial, you'll understand how to use pandas to perform tasks that you might be familiar with in SQL, such as selecting, filtering, aggregating, and joining data, but with the added power and versatility that Python offers. You'll see how pandas not only complements your SQL skills but also opens up a new world of data analysis possibilities.

Let's get started!

# Setting Up the Environment in Google Colab
In this section, we'll cover the initial steps required to start working with Python and pandas in Google Colab. Google Colab provides a Jupyter notebook environment that requires no setup on your part and offers free access to computing resources, making it an excellent platform for data analysis projects.

## What is Colab?

Colab, or ‘Colaboratory’, allows you to write and execute Python in your browser, with
- Zero configuration required
- Access to GPUs free of charge
- Easy sharing

Watch <a href="https://www.youtube.com/watch?v=inN8seMm7UI">Introduction to Colab</a> to find out more, or just get started below!

## Getting started

The document that you are reading is not a static web page, but an interactive environment called a <strong>Colab notebook</strong> that lets you write and execute code.

For example, here is a <strong>code cell</strong> with a short Python script that computes a value, stores it in a variable and prints the result:

In [1]:
seconds_in_a_day = 24 * 60 * 60
seconds_in_a_day

86400

To execute the code in the above cell, select it with a click and then either press the play button to the left of the code, or use the keyboard shortcut 'Command/Ctrl+Enter'. To edit the code, just click the cell and start editing.

Variables that you define in one cell can later be used in other cells:

In [2]:
seconds_in_a_week = 7 * seconds_in_a_day
seconds_in_a_week

604800

Colab notebooks allow you to combine <strong>executable code</strong> and <strong>rich text</strong> in a single document, along with <strong>images</strong>, <strong>HTML</strong>, <strong>LaTeX</strong> and more. When you create your own Colab notebooks, they are stored in your Google Drive account. You can easily share your Colab notebooks with co-workers, allowing them to comment on your notebooks or even edit them. To find out more, see <a href="/notebooks/basic_features_overview.ipynb">Overview of Colab</a>. To create a new Colab notebook you can use the File menu above, or use the following link: <a href="http://colab.research.google.com#create=true">Create a new Colab notebook</a>.

Colab notebooks are Jupyter notebooks that are hosted by Colab. To find out more about the Jupyter project, see <a href="https://www.jupyter.org">jupyter.org</a>.

# Basic Python Syntax and Concepts for SQL Users
Before diving into pandas, let's briefly go over some basic Python syntax and concepts. This will help SQL users understand the foundational elements of Python programming.

### Variables
Variables in Python are created by simply assigning a value to a name. Unlike SQL, there's no need to declare a variable before using it.

In [3]:
# Assigning a value to a variable
number_of_students = 30
course_name = "Data Science 101"

### Data Types
Python automatically detects the data type of a variable. The basic types include integers, floats (decimal numbers), strings (text), and booleans (True/False).



In [4]:
student_count = 30  # Integer
course_rating = 4.5  # Float
course_name = "Data Science 101"  # String
course_active = True  # Boolean

### Lists
Lists in Python are equivalent to arrays in other languages and can hold a collection of items.

In [5]:
students = ["Alice", "Bob", "Charlie"]

### Loops
Loops in Python, such as the for loop, allow you to iterate over a sequence of values.

In [6]:
for student in students:
    print(student)

Alice
Bob
Charlie


### Conditional Statements
Conditional statements in Python, using if, elif, and else, allow you to execute different blocks of code based on certain conditions.

In [7]:
if course_rating > 4:
    print("Highly rated course")
else:
    print("This course has room for improvement")

Highly rated course


### Functions
Functions in Python are defined using the def keyword and are used to encapsulate reusable blocks of code.

In [8]:
def greet(student_name):
    return f"Welcome, {student_name}!"

print(greet("Alice"))

Welcome, Alice!


# Fundamentals of Pandas
Pandas is an open-source library that provides high-performance, easy-to-use data structures and data analysis tools for Python. The name "pandas" is derived from "panel data," an econometrics term for multidimensional structured data sets.

**To use this library, let's first import it using `import pandas as pd`**

At the heart of pandas are the Series and DataFrame data structures, which allow you to store and manipulate data in a way that is both fast and intuitive.

In [9]:
# Importing the library
import pandas as pd

## Core Data Structures
### Series
A Series is a one-dimensional array-like object containing a sequence of values (similar to a list in Python) and an associated array of data labels, called its index. Think of it as a single column of a table.

In [10]:
# Creating a Series
s = pd.Series([1, 3, 5, 7, 9])
print(s)

0    1
1    3
2    5
3    7
4    9
dtype: int64


### DataFrame
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is essentially a spreadsheet or SQL table in Python.

In [11]:
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
print(df)

      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London


### Reading and Writing Data
Pandas supports a variety of file formats, making it easy to read data from and write data to different sources.

**Reading Data**

In [12]:
# Reading a CSV file
# df_csv = pd.read_csv('filename.csv')

# Reading an Excel file
# df_excel = pd.read_excel('filename.xlsx')

# Reading from a SQL database
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [13]:
# Reading SQL table as a DataFrame
df_sql = pd.io.gbq.read_gbq(
    '''
    SELECT call_date, interaction_id, customer_talk_duration, call_reason_1
    FROM `skyuk-uk-ds-csg-prod.Exec_Dashboard.dashboard_final`
    LIMIT 100
    ''',
    project_id= "skyuk-uk-ds-csg-prod",
    dialect='standard')

# Option 1: directly call this DataFrame to have a view
df_sql

Unnamed: 0,call_date,interaction_id,customer_talk_duration,call_reason_1
0,2022-03-20,2730043001,1348,Loyalty Change Upgrades
1,2022-03-20,2730143831,1093,Loyalty Change Upgrades
2,2022-03-20,2730111585,917,Loyalty Change Upgrades
3,2022-03-20,2730005427,1818,Loyalty Change Upgrades
4,2022-03-20,2729955877,189,Loyalty Change Upgrades
...,...,...,...,...
95,2022-03-20,2730064505,525,Classic TV Tech
96,2022-03-20,2730081939,486,Classic TV Tech
97,2022-03-20,2729923789,301,Classic TV Tech
98,2022-03-20,2730137829,374,Classic TV Tech


In [14]:
# Option 2: use print()
print(df_sql)

     call_date interaction_id  customer_talk_duration            call_reason_1
0   2022-03-20     2730043001                    1348  Loyalty Change Upgrades
1   2022-03-20     2730143831                    1093  Loyalty Change Upgrades
2   2022-03-20     2730111585                     917  Loyalty Change Upgrades
3   2022-03-20     2730005427                    1818  Loyalty Change Upgrades
4   2022-03-20     2729955877                     189  Loyalty Change Upgrades
..         ...            ...                     ...                      ...
95  2022-03-20     2730064505                     525          Classic TV Tech
96  2022-03-20     2730081939                     486          Classic TV Tech
97  2022-03-20     2729923789                     301          Classic TV Tech
98  2022-03-20     2730137829                     374          Classic TV Tech
99  2022-03-20     2729975555                     601          Classic TV Tech

[100 rows x 4 columns]


**Writing Data**

In [15]:
# Writing to a CSV file
# df_csv.to_csv('new_filename.csv')

# Writing to an Excel file
# df_excel.to_excel('new_filename.xlsx')

# Writing to a SQL database
# Define the destination table in BigQuery
destination_table = 'CSG_Insight_Team.python_tutorial_testing'
# Push DataFrame to BigQuery
df_sql.to_gbq(destination_table, project_id='skyuk-uk-ds-csg-prod', if_exists='replace')

100%|██████████| 1/1 [00:00<00:00, 4821.04it/s]


## Basic DataFrame Operations
Indexing, Selection, and Filtering
DataFrames allow for indexing and selection of data in a way that's similar to SQL but more flexible.

**Selecting Columns**

In [16]:
# Select a single column
ages = df['Age']

# Select multiple columns
subset = df[['Name', 'City']]

**Filtering Rows**

In [17]:
# Filter rows where Age is greater than 30
older_than_30 = df[df['Age'] > 30]
print(older_than_30)

      Name  Age    City
2  Charlie   35  London


**Row and Column Selection**

In [18]:
# Select rows 0 to 2 (inclusive) and all columns
print(df.loc[0:2, :])

      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London


In [19]:
# Select rows by index and columns by name
print(df.loc[0:2, ['Name', 'Age']])

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


# Transitioning from SQL to Pandas

In this section, we'll cover key Pandas functionalities that mirror common SQL operations, such as filtering, grouping, and joining data, but with the added nuances and capabilities of Pandas. We'll walk through how to translate your SQL knowledge into Pandas commands, highlighting the similarities and differences to help you become proficient in manipulating data with Pandas.

In [20]:
# Recall the SQL table we imported earlier

# df_sql = pd.io.gbq.read_gbq(
#    '''
#    SELECT call_date, interaction_id, customer_talk_duration, call_reason_1
#    FROM `skyuk-uk-ds-csg-prod.Exec_Dashboard.dashboard_final`
#    LIMIT 100
#    ''',
#    project_id= "skyuk-uk-ds-csg-prod",
#    dialect='standard')

## Querying and Filtering Data

### Selecting

Selecting columns interaction_id, customer_talk_duration, call_reason_1 and filtering rows where customer_talk_duration > 1000.

**SQL**

```sql
SELECT interaction_id, customer_talk_duration, call_reason_1
FROM `skyuk-uk-ds-csg-prod.Exec_Dashboard.dashboard_final`
WHERE customer_talk_duration > 1000


**Python**

In [21]:
filtered_df = df_sql[df_sql['customer_talk_duration'] > 1000]

# Call this DataFrame to have a view
filtered_df

Unnamed: 0,call_date,interaction_id,customer_talk_duration,call_reason_1
0,2022-03-20,2730043001,1348,Loyalty Change Upgrades
1,2022-03-20,2730143831,1093,Loyalty Change Upgrades
3,2022-03-20,2730005427,1818,Loyalty Change Upgrades
5,2022-03-20,2730035423,1524,Loyalty Change Upgrades
6,2022-03-20,2729906813,1287,Loyalty Change Upgrades
7,2022-03-20,2730102181,1115,Loyalty Change Upgrades
9,2022-03-20,2730073375,3689,Loyalty Change Upgrades
10,2022-03-20,2730045985,1044,Loyalty Change Upgrades
15,2022-03-20,2730032689,1579,Loyalty Change Upgrades
19,2022-03-20,2729928307,1158,Loyalty Change Upgrades


### Limit

Limit to 5 rows.

**SQL**

```sql
SELECT interaction_id, customer_talk_duration, call_reason_1
FROM `skyuk-uk-ds-csg-prod.Exec_Dashboard.dashboard_final`
LIMIT 5


**Python**

In [22]:
limited_df = df_sql.head(5) # use head(5) for the first 5 rows, tail(5) for the last 5 rows

# Call this DataFrame to have a view
limited_df

Unnamed: 0,call_date,interaction_id,customer_talk_duration,call_reason_1
0,2022-03-20,2730043001,1348,Loyalty Change Upgrades
1,2022-03-20,2730143831,1093,Loyalty Change Upgrades
2,2022-03-20,2730111585,917,Loyalty Change Upgrades
3,2022-03-20,2730005427,1818,Loyalty Change Upgrades
4,2022-03-20,2729955877,189,Loyalty Change Upgrades


### Distinct

Select distinct call reasons.

**SQL**

```sql
SELECT DISTINCT call_reason_1
FROM `skyuk-uk-ds-csg-prod.Exec_Dashboard.dashboard_final`


**Python**

In [23]:
distinct_df = df_sql['call_reason_1'].drop_duplicates()

# Call this DataFrame to have a view
distinct_df

0     Loyalty Change Upgrades
54       Welcome Winback Core
55           Classic BBP Tech
58                Welcome RTM
62                Mobile Tech
78                        PPV
92            Classic TV Tech
Name: call_reason_1, dtype: object

## Sorting and Ordering Data
Sort the rows by customer_talk_duration in descending order.

**SQL**

```sql
SELECT interaction_id, customer_talk_duration, call_reason_1
FROM `skyuk-uk-ds-csg-prod.Exec_Dashboard.dashboard_final`
ORDER BY customer_talk_duration DESC


**Python**

In [24]:
sorted_df = df_sql.sort_values(by='customer_talk_duration', ascending=False) # use 'ascending=True' if you wish to sort by ascending order

# Call this DataFrame to have a view
sorted_df

Unnamed: 0,call_date,interaction_id,customer_talk_duration,call_reason_1
35,2022-03-20,2730129407,3888,Loyalty Change Upgrades
9,2022-03-20,2730073375,3689,Loyalty Change Upgrades
47,2022-03-20,2730006651,3337,Loyalty Change Upgrades
40,2022-03-20,2729996773,3216,Loyalty Change Upgrades
38,2022-03-20,2730038419,3167,Loyalty Change Upgrades
...,...,...,...,...
13,2022-03-20,2730135503,86,Loyalty Change Upgrades
56,2022-03-20,2729911701,78,Classic BBP Tech
71,2022-03-20,2730097079,62,Mobile Tech
61,2022-03-20,2730055475,50,Welcome Winback Core


## Grouping and Aggregating Data
Pandas' groupby method is powerful, supporting complex grouping and aggregation operations, similar to SQL's GROUP BY.

**SQL**

```sql
SELECT call_reason_1, AVG(customer_talk_duration), SUM(customer_talk_duration)
FROM `skyuk-uk-ds-csg-prod.Exec_Dashboard.dashboard_final`
GROUP BY call_reason_1


**Python**

In [25]:
# Group by 'call_reason_1' and calculate average talk duration
grouped_df = df_sql.groupby('call_reason_1')['customer_talk_duration'].mean()

# Call this DataFrame to have a view
grouped_df

call_reason_1
Classic BBP Tech                 841.0
Classic TV Tech                  595.5
Loyalty Change Upgrades    1273.954545
Mobile Tech                 386.666667
PPV                              182.0
Welcome RTM                      288.0
Welcome Winback Core             281.5
Name: customer_talk_duration, dtype: Float64

In [26]:
# Group by 'call_reason_1' and calculate total talk duration
grouped_df = df_sql.groupby('call_reason_1')['customer_talk_duration'].sum()

# Call this DataFrame to have a view
grouped_df

call_reason_1
Classic BBP Tech            3364
Classic TV Tech             4764
Loyalty Change Upgrades    84081
Mobile Tech                 5800
PPV                          364
Welcome RTM                  864
Welcome Winback Core         563
Name: customer_talk_duration, dtype: Int64

## Merging and Joining Data
Pandas supports various types of joins (inner, outer, left, right) similar to SQL, using the merge function.

**SQL**

Assuming two tables/dataframes: df1 and df2, which we want to join on column Key.

```sql
SELECT *
FROM df1
INNER JOIN df2
ON df1.Key = df2.Key


**Python**

In [27]:
# Creating sample DataFrames
data1 = {'Key': ['A', 'B', 'C', 'D'],
         'Value1': [1, 2, 3, 4]}
df1 = pd.DataFrame(data1)

data2 = {'Key': ['A', 'B', 'E', 'F'],
         'Value2': ['apple', 'banana', 'orange', 'grape']}
df2 = pd.DataFrame(data2)

# Merging the DataFrames on 'Key' using inner join
merged_df = pd.merge(df1, df2, on='Key', how='inner')

# Call this DataFrame to have a view
merged_df

Unnamed: 0,Key,Value1,Value2
0,A,1,apple
1,B,2,banana


## Text Data Manipulation and Regular Expressions

**SQL**

```sql
SELECT interaction_id, customer_talk_duration, call_reason_1
FROM `skyuk-uk-ds-csg-prod.Exec_Dashboard.dashboard_final`
WHERE call_reason_1 LIKE '%tv%' and customer_talk_duration > 1000


**Python**

In [28]:
# Filter rows where call_reason_1 contains TV
filtered_tv_df = df_sql[df_sql['call_reason_1'].str.contains('TV', case=False)] # To make the filtering case insensitive, use the case=False argument

# Call this DataFrame to have a view
filtered_tv_df

Unnamed: 0,call_date,interaction_id,customer_talk_duration,call_reason_1
92,2022-03-20,2730151969,771,Classic TV Tech
93,2022-03-20,2730039803,1242,Classic TV Tech
94,2022-03-20,2730031353,464,Classic TV Tech
95,2022-03-20,2730064505,525,Classic TV Tech
96,2022-03-20,2730081939,486,Classic TV Tech
97,2022-03-20,2729923789,301,Classic TV Tech
98,2022-03-20,2730137829,374,Classic TV Tech
99,2022-03-20,2729975555,601,Classic TV Tech


In [29]:
# Filter rows where call_reason_1 contains 'TV' and customer_talk_duration is greater than 1000
filtered_tv_duration_df = df_sql[(df_sql['call_reason_1'].str.contains('TV', case=False)) & (df_sql['customer_talk_duration'] > 1000)]

# Call this DataFrame to have a view
filtered_tv_duration_df

Unnamed: 0,call_date,interaction_id,customer_talk_duration,call_reason_1
93,2022-03-20,2730039803,1242,Classic TV Tech


<div class="markdown-google-sans">

# More resources

### Working with notebooks in Colab

</div>

- [Overview of Colaboratory](/notebooks/basic_features_overview.ipynb)
- [Guide to markdown](/notebooks/markdown_guide.ipynb)
- [Importing libraries and installing dependencies](/notebooks/snippets/importing_libraries.ipynb)
- [Saving and loading notebooks in GitHub](https://colab.research.google.com/github/googlecolab/colabtools/blob/main/notebooks/colab-github-demo.ipynb)
- [Interactive forms](/notebooks/forms.ipynb)
- [Interactive widgets](/notebooks/widgets.ipynb)

<div class="markdown-google-sans">

<a name="working-with-data"></a>
### Working with data
</div>

- [Loading data: Drive, Sheets and Google Cloud Storage](/notebooks/io.ipynb)
- [Charts: visualising data](/notebooks/charts.ipynb)
- [Getting started with BigQuery](/notebooks/bigquery.ipynb)

<div class="markdown-google-sans">

### Machine learning crash course

<div>

These are a few of the notebooks from Google's online machine learning course. See the <a href="https://developers.google.com/machine-learning/crash-course/">full course website</a> for more.
- [Intro to Pandas DataFrame](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb)
- [Linear regression with tf.keras using synthetic data](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/linear_regression_with_synthetic_data.ipynb)

<div class="markdown-google-sans">

<a name="using-accelerated-hardware"></a>
### Using accelerated hardware
</div>

- [TensorFlow with GPUs](/notebooks/gpu.ipynb)
- [TensorFlow with TPUs](/notebooks/tpu.ipynb)