# First analysis

Download this Juypter Notebook and solve the tasks by inserting the SQL queries (it is not possible to solve the tasks in Colab). 

Example query (we include `df_example` at the end of the code cell to print the result):

```Python
df_example = pd.read_sql("""
    SELECT *
    FROM ecommerce_data;
""", engine)

df_example
```

## Setup

In [4]:
import os
import pandas as pd
from sqlalchemy import create_engine
from dotenv import load_dotenv

## Data

Connect to your MySQL-database "db_ecommerce" (make sure to prepare your `.env` file)

In [5]:
load_dotenv()   # take environment variables from .env

engine = create_engine("mysql+pymysql://" + os.environ['DB_URL'] + "/db_ecommerce", pool_pre_ping=True, pool_recycle=300)

In [None]:
# Use pandas to_sql function to create the table in the database
df = pd.read_csv('https://raw.githubusercontent.com/kirenz/lab-competitive/main/code/ecommerce.csv')
df.to_sql('ecommerce', engine, if_exists='replace')

## Average Revenue by E-Shop

- Show the average revenue for all shops
- Use the alias `average_revenue`

In [12]:

df_avg_revenue = pd.read_sql("""
    SELECT eshop_name, AVG(annual_revenue) as average_revenue
    FROM ecommerce_data
    GROUP BY eshop_name;
""", engine)

df_avg_revenue

Unnamed: 0,eshop_name,average_revenue
0,E-ShopA,54.163333
1,E-ShopB,54.36
2,E-ShopC,47.520556


## E-Shop with the Highest Average Rating

- Only show the E-Shop with the highest average rating
- Use the alias `average_rating`

In [13]:

df_best_rating = pd.read_sql("""
    SELECT eshop_name, AVG(average_rating) as average_rating
    FROM ecommerce_data
    GROUP BY eshop_name
    ORDER BY average_rating DESC
    LIMIT 1;
""", engine)

df_best_rating

Unnamed: 0,eshop_name,average_rating
0,E-ShopB,7.203333


## E-Shop Performance Over Time 

- Show the annual revenue per E-Shop by year
- Use the aliases `year`  and `total_revenue`

In [14]:

df_revenue_by_year = pd.read_sql("""
    SELECT eshop_name, YEAR(date) as year, SUM(annual_revenue) as total_revenue
    FROM ecommerce_data
    GROUP BY eshop_name, year;
""", engine)

df_revenue_by_year


Unnamed: 0,eshop_name,year,total_revenue
0,E-ShopA,2020,355.79
1,E-ShopA,2021,608.77
2,E-ShopA,2022,985.32
3,E-ShopB,2020,359.12
4,E-ShopB,2021,660.61
5,E-ShopB,2022,937.23
6,E-ShopC,2020,279.54
7,E-ShopC,2021,600.67
8,E-ShopC,2022,830.53


## Maximum Social Media Followers

- Show the maximum amount of social media followers for every E-shop in a descending order.
- Use the alias `max_followers`

In [17]:

df_most_followers = pd.read_sql("""
    SELECT eshop_name, MAX(social_media_followers) as max_followers
    FROM ecommerce_data
    GROUP BY eshop_name
    ORDER BY max_followers DESC;
""", engine)

df_most_followers

Unnamed: 0,eshop_name,max_followers
0,E-ShopA,2416.26
1,E-ShopC,2265.09
2,E-ShopB,2253.55


## Monthly Time on Site overview

- Show a monthly overview of the average time on site for every E-shop (order by E-shop and month)
- Use the aliases `month` and `average_time_on_site`
- Hint: in Python, you need to use %% instead of % in your query. This means you have to use `DATE_FORMAT(date, '%%m')`

In [20]:
df_user_growth = pd.read_sql("""
    SELECT eshop_name, DATE_FORMAT(date, '%%m') as month, AVG(time_on_site) as average_time_on_site
    FROM ecommerce_data
    GROUP BY eshop_name, month;
""", engine)

df_user_growth

Unnamed: 0,eshop_name,month,average_time_users
0,E-ShopA,1,5.266667
1,E-ShopA,2,3.813333
2,E-ShopA,3,5.186667
3,E-ShopA,4,6.67
4,E-ShopA,5,5.36
5,E-ShopA,6,7.266667
6,E-ShopA,7,6.11
7,E-ShopA,8,3.676667
8,E-ShopA,9,7.25
9,E-ShopA,10,7.033333


## Close the connection

In [28]:
# close connection
engine.dispose()