---
# Project: Dataset Visualization (Exploration and Understanding)
---

#### In this project, you will create an interactive data visualization application in Dash and use it to explore and understand an underlying Video Game dataset.

#### You will also be following the CRISP-DM process (with its 6 phases) for your project.

**Note! Your final Dash App needs to be a Python application using `.py` files (not a Jupyter Notebook, as in Workshop 4).**

Here's an initial Project Plan (which you can modify within your project group) to give you an overview of the project:

<img src="reports/images/project_plan.drawio.png"></img>

How to use this Notebook:
-  Firstly, make sure you have completed [1. Setup](#1-setup) which covers the Python packages, etc. you will need to use in your project.
-  Secondly, go though [2. Datasets](#2-datasets) which introduces you to the datasets you will be using in the project.
-  Then, go through [3. CRISP-DM](#3-crisp-dm) which covers CRISP-DM Outputs (i.e. documentation) you need to write during the project.
-  Next, go through [4. Tips](#4-tips) which contains some tips for the project.
-  Lastly, read [5. Submission](#5-submission) which contains information about submitting your project.

Project files and folders.
- The file `project.ipynb` is this Notebook containing the project specification.
- The folder `data` contains:
  - 5 comma-separated files (`.csv`) with the data for this project.
  - A subfolder `data_dictionary` which contains:
    - 5 comma-separated file (`.csv`) with descriptions of the columns (features, attributes) in each corresponding datasets above.
- The folder `reports` contains:
  - 6 markdown files (`.md`), one for each CRISP-DM phase, in which you will document each respective CRISP-DM phase for your project.
  - A subfolder `images` housing the Draw.IO file `project_plan.drawio.png` with an initial Project Plan in the form of a [Gantt Chart](https://en.wikipedia.org/wiki/Gantt_chart).

This Notebook contains:

- [1. Setup](#1-setup)
  - [1.1 Python Virtual Environment](#11-python-virtual-environment)
  - [1.2 Python and Pip Version](#12-python-and-pip-version)
  - [1.3 Python Packages](#13-python-packages)
  - [1.4 Python Modules](#14-python-modules)
  - [1.5 Pandas Configuration](#15-pandas-configuration)
  - [1.6 Visual Studio Code Extensions](#16-visual-studio-code-extensions)
- [2. Datasets](#2-datasets)
  - [2.1. The `vg_charts.csv` Dataset](#21-the-vg_charts.csv-dataset)
  - [2.2. The `vg_developers.csv` Dataset](#22-the-vg_developers.csv-dataset)
  - [2.3 The `vg_publishers.csv` Dataset](#23-the-vg_publishers.csv-dataset)
  - [2.4. The `vg_geo_cities.csv` Dataset](#24-the-vg_geo_cities.csv-dataset)
  - [2.5 The `vg_geo_countries.csv` Dataset](#25-the-vg_geo_countries.csv-dataset)
  - [2.6 Additional Datasets](#26-additional-datasets)
- [3. CRISP-DM](#3-crisp-dm)
  - [3.1 Business Understanding](#31-business-understanding)
  - [3.2 Data Understanding](#32-data-understanding)
  - [3.3 Data Preparation](#33-data-preparation)
  - [3.4 Visualization and App Development](#34-visualization-and-app-development)
  - [3.5 Evaluation](#35-evaluation)
  - [3.6 Deployment](#36-deployment)
- [4. Tips](#4-tips)
  - [4.1 Project Structure](#41-project-structure)
  - [4.2 Data Wrangling with Pandas](#42-data-wrangling-with-pandas)
  - [4.3 Numpy and Pandas Sample Code](#43-numpy-and-pandas-sample-code)
  - [4.4 Plotly and Dash Sample Code](#44-plotly-and-dash-sample-code)
- [5. Submission](#5-submission)
  - [5.1 Checklist](#51-checklist)
  - [5.2 Submit the Project](#52-submit-the-project)

---
# 1. Setup
---

The purpose of this section is to make sure you have an initial Python Virtual Environment with necessary Python packages for your project.

---
## 1.1 Python Virtual Environment

- Make sure you have downloaded the GitHub repository (which contains the project specification).
  - **Alternative 1**: You can use the same `git clone` as in the workshops.

    ```bash
    git clone https://github.com/paga-hb/C1VI1B_2025.git dataviz
    cd dataviz
    ```

  - **Alternative 2**: You can `git clone` the repository into another folder just for the project.

    ```bash
    git clone https://github.com/paga-hb/C1VI1B_2025.git dataviz_project
    cd dataviz_project
    ```

- Make sure you have created a Python Virtual Environment as below.

    ```bash
    conda create -y -p ./.conda python=3.12
    conda activate ./.conda
    python -m pip install --upgrade pip
    pip install ipykernel jupyter pylance numpy pandas matplotlib seaborn bokeh plotly basemap pillow
    pip install dash dash-bootstrap-components openpyxl lxml pycountry 
    pip install kaleido --upgrade
    # You can install more packages, but for this project the packages listed above will be useful
    ```

- Also make sure you have selected your Python Virtual Environment in any Notebook you will be using in your project (including this Notebook).
  - In the top right of your Notebook, click `Select Kernel`.
  - Choose your Python Virtual Environment.

---
## 1.2 Python and Pip Version

Let's start by printing out the Python and Pip version you are using in your Python Virtual Environment.

In [96]:
!python --version
!pip --version

Python 3.12.11
pip 25.2 from /home/patrick/projects/dataviz/.conda/lib/python3.12/site-packages/pip (python 3.12)


---
## 1.3 Python Packages

Let's also make sure the necessary Python packages are installed in our Python Virtual Environment.
- I have included some default packages below (many of which might already be installed in your Python Virtual Environment).
- Feel free to add any additional `pip` or `conda` packages.
  - Use `%pip install <package>` for Pip packages.
  - Use `%conda install <package>` for Conda packages.


Note
- When you install a `conda` package, you might have to restart your Jupyter Notebook kernel.
- To restart your Jupyter Notebook kernel, click the `Restart` button at the top of the Notebook.
- You will need the Chrome browser installed if you want to save Plotly plots.
  - Install it manually, or via `!choreo_get_chrome` (if it isn't already installed on your computer).

In [None]:
%pip install ipykernel
%pip install jupyter
%pip install pylance
%pip install numpy
%pip install pandas
%pip install matplotlib
%pip install seaborn
%pip install bokeh
%pip install plotly
%pip install basemap
%pip install pillow
%pip install dash
%pip install dash-bootstrap-components
%pip install openpyxl
%pip install lxml
%pip install pycountry
%pip install kaleido --upgrade
#!choreo_get_chrome

---
## 1.4 Python Modules

The cell below shows how to import some essential Python modules from the installed Python packages for your project.
- Numpy and Pandas are used for data mangling.
- Matplotlib, Seaborn, Plotly (and Bokeh) are used for visualizations.
- Dash is used to create web-based, interactive data visualization applications (dashboards).
- Feel free to import additional modules in your project.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import seaborn as sns
import plotly
import plotly.graph_objects as go
from plotly.tools import mpl_to_plotly
import plotly.express as px
import pycountry
from dash import Dash, html, dcc
from dash import dash_table as dtc
from dash.dependencies import Input, Output
from dash.exceptions import PreventUpdate
import dash_bootstrap_components as dbc

---
## 1.5 Pandas Configuration

Let's also configure Pandas to set the maximum number of rows and columns displayed, including the maximum column width, from a `DataFrame` in a Jupyter Notebook cell.
- Configuring the maximum number of rows.
  - `pd.set_option('display.max_rows', None)` displays an unlimited number of rows (don't use this when working with large `DataFrame`s).
  - `pd.set_option('display.max_rows', 25)` displays a maximum of `25` rows.
  - `pd.reset_option('display.max_rows')` resets the number of rows to the default setting.
- Configuring the maximum number of columns.
  - `pd.set_option('display.max_columns', None)` displays an unlimited number of columns.
  - `pd.set_option('display.max_columns', 25)` displays a maximum of `25` columns.
  - `pd.reset_option('display.max_columns')` resets the number of columns to the default setting.
- Configuring the maximum column width.
  - `pd.set_option('display.max_colwidth', None)` displays the full column width without truncation (don't use this when working with very wide columns).
  - `pd.set_option('display.max_colwidth', 25)` sets the maximum column width to `25` (wider columns will be truncated with `...`).
  - `pd.reset_option('display.max_colwidth')` resets the column width to the default setting.

In [75]:
pd.set_option('display.max_rows', None)     # show all rows
pd.set_option('display.max_columns', None)  # show all columns
pd.set_option('display.max_colwidth', None) # show full column names without truncating them

---
## 1.6 Visual Studio Code Extensions

Let's add some useful Visual Studio Code Extensions.
- [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python) enables us to work with Python in VSCode (this extension should already be installed).
- [Jupyter](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) enables us to work with Jupyter Notebooks in VSCode (this extension should already be installed).
- [Markdown All in One](https://marketplace.visualstudio.com/items?itemName=yzhang.markdown-all-in-one) enables additional support for working with MarkDown files in VSCode.
- [Draw.io Integration](https://marketplace.visualstudio.com/items?itemName=hediet.vscode-drawio) enables us to create [Draw.io](https://app.diagrams.net) drawings in VSCode and save them as images.


**Note**
- If you don't want to install these VSCode Extensions manually, you can just run the cell below to install them.
  - The exclamation mark `!` tells Jupyter that we want to execute a shell command in you operating system's shell (e.g. `cmd.exe` or `powershell.exe` on Windows, `zsh` on MacOS, or `bash` on Linux/Ubuntu).
  - The command `code --install extension <unique identifier> --force` installs an extension.
    - `code` is the command-line VSCode tool.
    - `--install-extension` is the flag we use to install a VSCode Extension.
    - `<unique identifier>` is the unique identifier for the VSCode Extension.
    - `--force` is a flag that will reinstall the VSCode Extension if necessary (it ensures the newest version is installed).
- The unqiue identifier for a VSCode Extension can be found in the right margin:
  - Under `More Info` on the web-based VSCode Extension Market Place (i.e. by clicking the links above).
  - Under `Installation` in the Extensions View (Windows/Linux: `Ctrl + Shift + X`, MacOS: `Cmd + Shift + X`) in VSCode.

In [295]:
!code --install-extension ms-python.python --force
!code --install-extension ms-toolsai.jupyter --force
!code --install-extension yzhang.markdown-all-in-one --force
!code --install-extension hediet.vscode-drawio --force

Installing extensions...
Extension 'ms-python.python' is already installed.
Installing extensions...
Extension 'ms-toolsai.jupyter' is already installed.
Installing extensions...
Extension 'yzhang.markdown-all-in-one' is already installed.
Installing extensions...
Extension 'hediet.vscode-drawio' is already installed.


---
# 2. Datasets
---

The purpose of this section is to introduce you to the datasets for this project, which are located in the `data` folder.
- Some of the datasets are not entirely clean.
- You will need to load, examine, and pre-process (clean, transform, merge, etc.) the datasets for the task at hand.

The *data dictionaries* for the datasets are located in the folder `data/data_dictionary` (with the same file names as the datasets).
- The *data dictionary* file for each dataset contains the column names (attributes, features) together with a description of each column (attribute, feature).

Let's load and print each data dictionary together with the first 5 rows in each corresponding dataset.

---
## 2.1. The `vg_charts.csv` Dataset

The `vg_charts.csv` dataset is the main dataset, containing information about video game titles, platform (e.g. PS2, XBOX360, PC), genre (e.g. Action, Shooter, Role-Playing), publisher, developer, critic score, regional sales (e.g. North America, Japan, Europe and Africa), release date, etc.

In [274]:
# Load and display the data dictionary for the vg_charts dataset
df_charts_dictionary = pd.read_csv('data/data_dictionary/vg_charts.csv')
df_charts_dictionary

Unnamed: 0,Field,Description
0,img,URL slug for the box art at vgchartz.com
1,title,Game title
2,platform,Platform the game was released for
3,genre,Genre of the game
4,publisher,Publisher of the game
5,developer,Developer of the game
6,critic_score,Metacritic score (out of 10)
7,total_sales,Global sales of copies in millions
8,na_sales,North American sales of copies in millions
9,jp_sales,Japanese sales of copies in millions


In [276]:
# Load the vg_charts dataset and display the first 5 rows
df_charts = pd.read_csv('data/vg_charts.csv')
df_charts.head()

Unnamed: 0,img,title,platform,genre,publisher,developer,critic_score,total_sales,na_sales,jp_sales,pal_sales,other_sales,release_date,last_update
0,/games/boxart/full_6510540AmericaFrontccc.jpg,Grand Theft Auto V,PS3,Action,Rockstar Games,Rockstar North,9.4,20.32,6.37,0.99,9.85,3.12,2013-09-17,
1,/games/boxart/full_5563178AmericaFrontccc.jpg,Grand Theft Auto V,PS4,Action,Rockstar Games,Rockstar North,9.7,19.39,6.06,0.6,9.71,3.02,2014-11-18,2018-01-03
2,/games/boxart/827563ccc.jpg,Grand Theft Auto: Vice City,PS2,Action,Rockstar Games,Rockstar North,9.6,16.15,8.41,0.47,5.49,1.78,2002-10-28,
3,/games/boxart/full_9218923AmericaFrontccc.jpg,Grand Theft Auto V,X360,Action,Rockstar Games,Rockstar North,,15.86,9.06,0.06,5.33,1.42,2013-09-17,
4,/games/boxart/full_4990510AmericaFrontccc.jpg,Call of Duty: Black Ops 3,PS4,Shooter,Activision,Treyarch,8.1,15.09,6.18,0.41,6.05,2.44,2015-11-06,2018-01-14


Let's also print out the number of rows, and for each feature (column), its data type, the number of missing values, and the number of unique values.
- Another option is to use `df_charts.info()` to show column names, number of non-missing values, and data types.

In [279]:
print(f'Number of rows: {len(df_charts)}')
df_temp = df_charts.dtypes.to_frame().reset_index().rename(columns={0: "Dtype", "index": "Column"})
df_temp["Missing Value Count"] = df_charts.isna().sum().values
df_temp["Unique Value Count"] = df_charts.nunique().values
df_temp

Number of rows: 64016


Unnamed: 0,Column,Dtype,Missing Value Count,Unique Value Count
0,img,object,0,56177
1,title,object,0,39798
2,platform,object,0,81
3,genre,object,0,20
4,publisher,object,0,3383
5,developer,object,17,8862
6,critic_score,float64,57338,89
7,total_sales,float64,45094,482
8,na_sales,float64,51379,320
9,jp_sales,float64,57290,121


Notice the column `img` that contains *URL slugs for the box art* at [https://www.vgchartz.com](https://www.vgchartz.com)
- If you want to use box art for various game titles in you Dash application, you can use the code below to download the box art.
- It places the box art images in the folder `data/boxart_images`.

**Note**
- The code below only downloads box art for the first 10 game titles.
- I wouldn't recommend downloading all box art at once, since it will take quite some time, and will require a lot of disk space.

In [282]:
import pandas as pd
import requests
import os

# Let's download the boxart for the first 10 game titles
# NOTE! I wouldn't recommend downloading all box art at once,
# since it will take quite some time, and will require a lot of disk space.
df = df_charts.head(10)

# Base URL for VGChartz images
base_url = "https://www.vgchartz.com"  # we will prepend this to each slug

# Create a folder to save box art images
os.makedirs("data/boxart_images", exist_ok=True)

# Loop over the dataset
for index, row in df.iterrows():
    slug = row['img']  # e.g., "/games/boxart/full_6510540AmericaFrontccc.jpg"
    if pd.isna(slug) or slug.strip() == "":
        continue  # skip missing URLs
    
    url = base_url + slug
    filename = os.path.join("data/boxart_images", os.path.basename(slug))
    
    try:
        # Download the image
        response = requests.get(url)
        response.raise_for_status()  # raise error if download fails
        
        # Save to file
        with open(filename, 'wb') as f:
            f.write(response.content)
        
        print(f"Downloaded: {filename}")
    except requests.HTTPError as e:
        print(f"Failed to download {url}: {e}")

Downloaded: data/boxart_images/full_6510540AmericaFrontccc.jpg
Downloaded: data/boxart_images/full_5563178AmericaFrontccc.jpg
Downloaded: data/boxart_images/827563ccc.jpg
Downloaded: data/boxart_images/full_9218923AmericaFrontccc.jpg
Downloaded: data/boxart_images/full_4990510AmericaFrontccc.jpg
Downloaded: data/boxart_images/full_call-of-duty-modern-warfare-3_517AmericaFront.jpg
Downloaded: data/boxart_images/full_call-of-duty-black-ops_5AmericaFront.jpg
Downloaded: data/boxart_images/full_4653215AmericaFrontccc.jpg
Downloaded: data/boxart_images/full_1977964AmericaFrontccc.jpg
Downloaded: data/boxart_images/full_4649679AmericaFrontccc.png


---
## 2.2. The `vg_developers.csv` Dataset

The `vg_developers.csv` dataset contains information about game developers, including their headquarters (city and country).

In [283]:
# Load and display the data dictionary for the vg_developers dataset
df_developers_dictionary = pd.read_csv('data/data_dictionary/vg_developers.csv')
df_developers_dictionary

Unnamed: 0,Field,Description
0,developer,Developer of the game
1,city,Developer city headquarters
2,country,Developer country headquarters


In [284]:
# Load the vg_developers dataset and display the first 5 rows
df_developers = pd.read_csv('data/vg_developers.csv')
df_developers.head()

Unnamed: 0,developer,city,country
0,0verflow,Tokyo,Japan
1,11 bit studios,Warsaw,Poland
2,1C Company,Moscow,Russia
3,1-Up Studio,Tokyo,Japan
4,2K Czech,Brno,Czech Republic


Let's also print out the number of rows, and for each feature (column), its data type, the number of missing values, and the number of unique values.

In [285]:
print(f'Number of rows: {len(df_developers)}')
df_temp = df_developers.dtypes.to_frame().reset_index().rename(columns={0: "Dtype", "index": "Column"})
df_temp["Missing Value Count"] = df_developers.isna().sum().values
df_temp["Unique Value Count"] = df_developers.nunique().values
df_temp

Number of rows: 731


Unnamed: 0,Column,Dtype,Missing Value Count,Unique Value Count
0,developer,object,0,731
1,city,object,5,309
2,country,object,0,51


---
## 2.3 The `vg_publishers.csv` Dataset

The `vg_publishers.csv` dataset contains information about game publishers, including their headquarters (city and country).

In [286]:
# Load and display the data dictionary for the vg_publishers dataset
df_publishers_dictionary = pd.read_csv('data/data_dictionary/vg_publishers.csv')
df_publishers_dictionary

Unnamed: 0,Field,Description
0,publisher,Publisher of the game
1,city,Publisher city headquarters
2,country,Publisher country headquarters


In [287]:
# Load the vg_publishers dataset and display the first 5 rows
df_publishers = pd.read_csv('data/vg_publishers.csv')
df_publishers.head()

Unnamed: 0,publisher,city,country
0,07th Expansion,,Japan
1,11 bit studios,Warsaw,Poland
2,1C Company,Moscow,Russia
3,20th Century Games,"Century City, California",United States
4,2K Games,"Novato, California",United States


Let's also print out the number of rows, and for each feature (column), its data type, the number of missing values, and the number of unique values.

In [288]:
print(f'Number of rows: {len(df_publishers)}')
df_temp = df_publishers.dtypes.to_frame().reset_index().rename(columns={0: "Dtype", "index": "Column"})
df_temp["Missing Value Count"] = df_publishers.isna().sum().values
df_temp["Unique Value Count"] = df_publishers.nunique().values
df_temp

Number of rows: 889


Unnamed: 0,Column,Dtype,Missing Value Count,Unique Value Count
0,publisher,object,0,888
1,city,object,147,368
2,country,object,1,50


---
## 2.4. The `vg_geo_cities.csv` Dataset

The `vg_geo_cities.csv` dataset contains information about cities (city names), including each city's geographical coordinates (latitude and longitude).

In [289]:
# Load and display the data dictionary for the vg_geo_cities dataset
df_geo_cities_dictionary = pd.read_csv('data/data_dictionary/vg_geo_cities.csv')
df_geo_cities_dictionary

Unnamed: 0,Field,Description
0,city,City name
1,latitude,City center latitude
2,longitude,City center longitude


In [290]:
# Load the vg_geo_cities dataset and display the first 5 rows
df_geo_cities = pd.read_csv('data/vg_geo_cities.csv')
df_geo_cities.head()

Unnamed: 0,city,latitude,longitude
0,Shanghai,31.22222,121.45806
1,Buenos Aires,-34.61315,-58.37723
2,Mumbai,19.07283,72.88261
3,Mexico City,19.42847,-99.12766
4,Beijing,39.9075,116.39723


Let's also print out the number of rows, and for each feature (column), its data type, the number of missing values, and the number of unique values.

In [291]:
print(f'Number of rows: {len(df_geo_cities)}')
df_temp = df_geo_cities.dtypes.to_frame().reset_index().rename(columns={0: "Dtype", "index": "Column"})
df_temp["Missing Value Count"] = df_geo_cities.isna().sum().values
df_temp["Unique Value Count"] = df_geo_cities.nunique().values
df_temp

Number of rows: 23412


Unnamed: 0,Column,Dtype,Missing Value Count,Unique Value Count
0,city,object,0,22244
1,latitude,float64,0,22617
2,longitude,float64,0,22829


---
## 2.5 The `vg_geo_countries.csv` Dataset

The `vg_geo_countries.csv` dataset contains information about countries (country names), including each country's 2-letter (ISO2) and 3-letter (ISO3) country code, and each country's geographical coordinates (latitude and longitude).

In [292]:
# Load and display the data dictionary for the vg_geo_countries dataset
df_geo_countries_dictionary = pd.read_csv('data/data_dictionary/vg_geo_countries.csv')
df_geo_countries_dictionary

Unnamed: 0,Field,Description
0,Country,Country name
1,Alpha-2 code,ISO2 country code
2,Alpha-3 code,ISO3 country code
3,Latitude,Country center latitude
4,Longitude,Country center longitude


In [293]:
# Load the vg_geo_countries dataset and display the first 5 rows
df_geo_countries = pd.read_csv('data/vg_geo_countries.csv')
df_geo_countries.head()

Unnamed: 0,Country,Alpha-2 code,Alpha-3 code,Latitude,Longitude
0,Afghanistan,"""AF""","""AFG""","""33""","""65"""
1,Åland Islands,"""AX""","""ALA""","""60.116667""","""19.9"""
2,Albania,"""AL""","""ALB""","""41""","""20"""
3,Algeria,"""DZ""","""DZA""","""28""","""3"""
4,American Samoa,"""AS""","""ASM""","""-14.3333""","""-170"""


Let's also print out the number of rows, and for each feature (column), its data type, the number of missing values, and the number of unique values.

In [294]:
print(f'Number of rows: {len(df_geo_countries)}')
df_temp = df_geo_countries.dtypes.to_frame().reset_index().rename(columns={0: "Dtype", "index": "Column"})
df_temp["Missing Value Count"] = df_geo_countries.isna().sum().values
df_temp["Unique Value Count"] = df_geo_countries.nunique().values
df_temp

Number of rows: 262


Unnamed: 0,Column,Dtype,Missing Value Count,Unique Value Count
0,Country,object,0,262
1,Alpha-2 code,object,0,251
2,Alpha-3 code,object,0,251
3,Latitude,object,0,183
4,Longitude,object,0,202


---
## 2.6 Additional Datasets

Feel free to use any additonal (appropriate) datasets for your project.
- Maybe you want to search for an include game reviews, etc.?
- If so, don't forget to add them to your CRISP-DM documentation.

---
# 3. CRISP-DM
---

The purpose of this section is to introduce you to the CRISP-DM reports in the `reports` folder.

You will be following a modified CRISP-DM process for the project.
- As you know, the CRISP-DM process contains 6 Phases.
  - Business Understanding
  - Data Understanding
  - Data Preparation
  - Modeling
  - Evaluation
  - Deployment
- Since this project doesn't contain any traditional data mining or machine learning, I have renamed the `Modeling` Phase to `Visualization and App Development`.
  - Business Understanding
  - Data Understanding
  - Data Preparation
  - `Visualization and App Development`
  - Evaluation
  - Deployment
- As you also know, each Phase contains a number of `Generic Tasks`.
  - Usually these are adapted to `Specific Tasks` for a project.
  - We will stick with the `Generic Tasks` for most project Phases, but I have modified a few to make them more appropriate for this project.
    - All the `Generic Tasks` are included, unaltered, for the following Phases.
      - Business Understanding
      - Data Understanding
      - Data Preparation
    - The `Visualization and App Development` Phase contains the following Tasks.
      - Select Visualization Technique
      - Generate App Design
      - Build App
      - Assess App
    - The `Evaluation` Phase contains the following Tasks.
      - Evaluate Results
      - Review Process
      - Determine Next Steps
    - The `Deployment` Phase contains the following Tasks.
      - Produce Final Report
      - Review Project
- Finally, as you also know, each Task contains a number of `Outputs`, often documentation in the form of reports.
  - To make this a bit simpler for this project, the `reports` folder contains one MarkDown file (`.md`) for each Phase.
  - Each MarkDown file (`.md`) contains the name of the `Phase`, its `Tasks`, and its `Outputs`.
  - For each `Output`, I have written instructions for what you need to document, where I have already completed most `Outputs` for the Business Understanding report.

Next, let's quickly skim through the various reports you need to complete for the project.

---
## 3.1 Business Understanding

- Right-click the file `1_BusinessUnderstading.md` in the `reports` folder, and choose `Open to the Side` from the pop-up context menu.
  - This opens the raw MarkDown file in the Code Editor in edit mode, side-by-side with this Jupyter Notebook.
- Click the raw Markdown file in the Code Editor to make sure it is selected.
  - Notice the icons in the top-right of the Mardown file.
- Click the left-most icon in the top-right of the raw Mardown file in the Code Editor to get a side-by-side preview of the rendered Markdown file.
  - If you hover the mouse pointer over the icon, the tool-tip should read something like *Open Preview to the Side*.
- Use the mouse to scroll in the raw MarkDown file, and also try changin some text.
  - Notice, the changes are immediately shown in the rendered preview of the MarkDown file.
- Close the raw Mardown file.
  - Leave the rendered preview Markdown file open in the Code Editor.
- Let's examine the Mardown file.
  - At the top of the Mardown file, we see `Phase: Business Understanding`, indicating this report is for the Business Understanding Phase.
  - Below this, we see a number of Tasks, e.g. `Task: Determine Business Objectives`.
  - Below each Task, we see a number ouf Outputs, e.g. `Output: Background`.
  - I have already completed most Outputs in this report, but you need to:
    - Add more appropriate questions under `Output: Business Objectives` (optional).
    - Add more terms and descriptions under `Output: Terminology` (optional).
    - Add more appropriate questions under `Output: Data Mining Goals` (optional).
    - Update the Project Plan under `Output: Project Plan` (mandatory).

- So, the only Output you need to update in this report is really `Output: Project Plan`.
  - Discuss within your project group how the team wants to plan the project.
  - Then update the Project Plan accordingly.

- Project Plan
  - In the Mardown file, I describe how I created the sample Project Plan (using Draw.io).
    - The Project Plan was created as a `.png` image in the file `project_plan.drawio.png` in the `reports/images` folder.
    - Then the image was included in `1_BusinessUnderstading.md` using the HTML tag `img`.
    - Note, if you modify `project_plan.drawio.png`, the changes will automatically be shown in `1_BusinessUnderstading.md`.
  - The sample Project Plan contains the Phases and their respective Tasks.
  - Feel free to modify the sample Project Plan via Draw.io, or to use any project planning tool (e.g. MS Project or Excel).
    - If you use an external project planning tool, you need to add the file to you project before submitting your project via Canvas.

---
## 3.2 Data Understanding

- Open the Markdown file `2_DataUnderstading.md`.
  - Right-click the file `2_DataUnderstading.md` in the `reports` folder, and choose `Open to the Side` from the pop-up context menu.
  - Then click the left-most icon in the top-right of the raw Mardown file in the Code Editor to get a side-by-side preview of the rendered Markdown file.
  - Close the raw Mardown file.
- Let's examine the Markdown file.
  - I have already completed `Output: Initial Data Collection Report`.
    - Feel free to update it if you collect additional datasets.
  - You need to complete the following Outputs.
    - `Output: Data Description Report`
    - `Output: Data Exploration Report`
    - `Output: Data Quality Report`

---
## 3.3 Data Preparation

- Open the Markdown file `3_DataUnderstading.md`.
  - Right-click the file `3_DataUnderstading.md` in the `reports` folder, and choose `Open to the Side` from the pop-up context menu.
  - Then click the left-most icon in the top-right of the raw Mardown file in the Code Editor to get a side-by-side preview of the rendered Markdown file.
  - Close the raw Mardown file.
- Let's examine the Markdown file.
  - Here you need to complete all Outputs.
    - `Output: Rationale for Inclusion/Exclusion`
    - `Output: Data Cleaning Report`
    - `Output: Derived Attributes`
    - `Output: Generated Records`
    - `Output: Merged Data`
    - `Output: Reformatted Data`
    - `Output: Datasets`
    - `Output: Dataset Descriptions`

---
## 3.4 Visualization and App Development

- Open the Markdown file `4_VisualizationAndAppDevelopment.md`.
  - Right-click the file `4_VisualizationAndAppDevelopment.md` in the `reports` folder, and choose `Open to the Side` from the pop-up context menu.
  - Then click the left-most icon in the top-right of the raw Mardown file in the Code Editor to get a side-by-side preview of the rendered Markdown file.
  - Close the raw Mardown file.
- Let's examine the Markdown file.
  - Here you need to complete all Outputs.
    - `Output: Visualization Techniques`
    - `Output: App Design`
    - `Output: App Description`
    - `Output: App Assessment`

---
## 3.5 Evaluation

- Open the Markdown file `5_Evaluation.md`.
  - Right-click the file `5_Evaluation.md` in the `reports` folder, and choose `Open to the Side` from the pop-up context menu.
  - Then click the left-most icon in the top-right of the raw Mardown file in the Code Editor to get a side-by-side preview of the rendered Markdown file.
  - Close the raw Mardown file.
- Let's examine the Markdown file.
  - Here you need to complete the following Outputs.
    - `Output: Assessment of Data Mining Results`
    - `Output: Approved Visualizations and App Functionality`
    - `Output: Review of Process`
  - Note, we will skip the following Outputs in this project.
    - `Output: List of Possible Actions`
    - `Output: Decisions`

---
## 3.6 Deployment

- Open the Markdown file `6_Deployment.md`.
  - Right-click the file `6_Deployment.md` in the `reports` folder, and choose `Open to the Side` from the pop-up context menu.
  - Then click the left-most icon in the top-right of the raw Mardown file in the Code Editor to get a side-by-side preview of the rendered Markdown file.
  - Close the raw Mardown file.
- Let's examine the Markdown file.
  - Here you need to complete the following Outputs in the Mardown file.
    - `Output: Final Report`
    - `Output: Experience Documentation`
  - The following Outputs also need to be completed, but not in the Markdown file.
    - `Output: Final Presentation`
      - This is the presentation you need to create and present at the seminar. 
      - The presentation also has to be submitted separately via Canvas.
    - `Output: App Demo`
      - This is the Dash application you need to prepare and demostrate (demo) at the seminar as part of your presentation.

**Note**
- The Notebook `presentation.ipynb` in the `presentation` folder contains more information about the presentation and the app demo.

---
# 4. Tips
---

This section contains some tips for your project.

---
## 4.1 Project Structure

Currently, you have the complete GitHub repository for this course loaded into VSCode with the `dataviz` folder as your Workspace Folder.

Feel free to create a separate folder for the project, e.g.:

    ```python
    ├── project
      ├── data
        ├── data_dictionary
          ├── vg_charts.csv
          ├── vg_developers.csv
          ├── vg_geo_cities.csv
          ├── vg_geo_countries.csv
          └── vg_publishers.csv
        ├── vg_charts.csv
        ├── vg_developers.csv
        ├── vg_geo_cities.csv
        ├── vg_geo_countries.csv
        └── vg_publishers.csv
      ├── reports
        ├── 1_BusinessUnderstanding.md
        ├── 2_DataUnderstanding.md
        ├── 3_DataPreparation.md
        ├── 4_VisualizationAndAppDevelopment.md
        ├── 5_Evaluation.md
        ├── 6_Deployment.md
      └── project.ipynb
    ```

Then, under the root `project` folder, you can add any `.py` files and/or `.ipynb` files and subfolders you create during the project, e.g.:

    ```python
    ├── project
      ├── .conda
      ├── main.py
      ├── utility
        ├── __init__.py
        └── preprocessing.py
      ├── notebooks
        └── preprocessing.ipynb
      ├── .gitignore
      ├── LICENSE
      └── README.md
    ```

Here I've added:
- A `.conda` Python Virtual Environment (basically, I've copied the existing `.conda` folder described in [1. Setup](#1-setup) into the `project` folder.
- A `main.py` file for the entry point into the Dash application (i.e. it includes `if __name__ == "__main__":`).
- A `utility` folder with:
  - A `__init__.py` file, which makes the `utility` folder a python package called `utility`.
  - A `preprocessing.py` file, which in this context makes the file a `preprocessing` module under the `utility` package.
    - This file might contain python functions for loading the `.csv` files as Pandas DataFrames, preprocessing them, and saving them as Pickle (`.pkl`) files.
    - The various functions can be imported into `main.py` as e.g. `from utility.preprocessing import load, preprocess, save`.
- A `notebooks` folder with:
  - A `preprocessing.ipynb` file, which does all loading, preprocessing, and saving in a Jupyter Notebook instead of in a python package/module as above.
- A `.gitignore` file for excluding various files and folder from Git SCM (source control management), if you are using Git SCM in your project.
- A `LICENSE` file (you can choose any license file, or skip adding one).
- A `README.md` file with instructions for installing the Python Virtual Environment and Python packages, how to use the Dash application, and how to use any Notebooks.

You can structure your project any way you like:
- **Although, your final Dash application must be run as a Python application (not within a Notebook), e.g. `python.exe main.py`.**
- Feel free to create a GitHub repository for the project (which will make it easier for the team/group to work on different files simultaneously, and enable versioning various files).
- You can try various code snippets, locally, in a Notebook before adding the code to the Python Dash application.
- You can add your own Python packages and Python modules (as in the example above) to better structure your Python application.
- etc.

Create a `README.md` file in your root `project` folder:
- Add a `README.md` file to your root `project` folder, in which you describe:
  - How to create a Python Virtual Environment for your project.
  - How to install the necessary Python packages for your project.
  - What each folder and file in your project represents, and what functionality is included in each project file.
- The `README.md` file should provide clear instructions to any user that wants to use your project.

---
## 4.2 Data Wrangling with Pandas

- I suggest your load the various `.csv` files into Pandas DataFrames, preprocess them, and save one or more resulting DataFrames as Pickle (`.pkl`) files.
  - Then you can just load the Pickle (`.pkl`) file into any `.py` and/or `.ipynb` file when you want to use the preprocessed data.
- Remember to *clean* the datasets.
- If a column contains strings (i.e. its `dtype` is `object`), you can access various string functions via `df['Col'].str.<string function>`, e.g.:
  - You can remove leading and trailing whitespace with `df['Col'].str.strip()`
  - You can remove any instance of the chracter `"` with `df['Col'].str.repalce('"', '')`.
- You can drop duplicate rows from a Pandas DataFrame with `df = df.drop_duplicates()`.
  - You can drop duplicate rows from a Pandas DataFrame based on duplicates in a single column with `df = df.drop_duplicates(subset=['Col'])`.
- Most DataFrame methods support the `inplace` keyword argument.
  - So instead of assigning the result back to a variable as above, you can use `df.drop_duplicates(subset=['Col'], inplace=True)`.
- You can drop rows with any missing values with `df.dropna(inplace=True)`.
  - You can drop rows based on missing values in a single column with `df.dropna(subset=['Col'], inplace=True)`.
- You can impute (replace) all missing values with e.g. the value `0` with  `df.fillna(0, inplace=True)`.
  - You can impute (replace) all missing values in a specific column with e.g. the value `0` with  `df['Col'].fillna(0, inplace=True)`.
- You can use Pandas to perform the equivalent of a T-SQL SELECT statement as below.
  - Of course, you can use much simpler `df.groupby()` expressions for less complex queries than in the sample below.

  ```python
  # Pandas
  df[df['Col1'].str.contains('something') & (df['Col2'] == 5)]
  .groupby(['Col1', 'Col2'], as_index=False)[['Col3', 'Col4']]
  .sum()
  .rename(columns={'Col3': 'Col3Total', 'Col4': 'Col4Total'})
  .query('Col3Total >= 10 and Col4Total >= 15')
  .sort_values(by=['Col1', 'Col3Total'], ascending=[True, False])
  ```

  ```sql
  /* T_SQL */
  SELECT Col1, Col2, SUM(Col3) AS Col3Total, SUM(Col4) AS Col4Total
  FROM df
  WHERE COl1 LIKE '%something%' AND Col2 = 5
  GROUP BY Col1, Col2
  HAVING SUM(Col3) >= 10 AND SUM(Col4) >= 15
  ORDER BY Col1 ASC, Col3Total DESC;
  ```

- To get the equivalent of a T-SQL `SELECT TOP 5 Col1 FROM df ORDER BY Col1 ASC;`, you can use `df[['Col1']].sort_values(by='Col1', ascending=True).head(5)`.

- To perform the equivalent of an SQL JOIN in Pandas, remember you can use a DataFrame's `merge()` method.
  - It supports an `'outer'`, `'inner'`, `'left'`, or `'right'` JOIN via the `how` keyword argument.
  - You can JOIN on the same column name in the left and right DataFrames using the `on` keyword argument.
  - You can JOIN on a different column name in the left and right DataFrames using the `left_on` and `right_on` keyword arguments.
  - You can JOIN on the index in the left and right DataFrames using the `left_index=True` and `right_index=True` keyword arguments.

- To change the data type for a specific column, e.g. to `np.float64`, you can use `df['Col'] = df['Col'].astype(np.float64)`.


**Recommended Reading (not mandatory)**

I can recommed the article [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) which covers how to format your data (DataFrame) into a usable format for visualization, and much more.

Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10

---
## 4.3 Numpy and Pandas Sample Code

Feel free to copy and tweak any Numpy and Pandas code from the Workshops and the Course Literature in your project.

Other good Numpy and Pandas sources are:
- The [Numpy Documentation](https://numpy.org/doc/stable) and the [Numpy Guides](https://numpy.org/learn), which contain numerous Numpy examples.
- The [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html) which contains numerous Pandas examples.

---
## 4.4 Plotly and Dash Sample Code

Feel free to copy and tweak any Plotly and Dash code from the Workshops and the Course Literature in your project.

Other good Plotly and Dash sources are:
- The [Plotly Open Source Graphing Library for Python](https://plotly.com/python) which contains numerous Plotly examples.
- The [Dash Python User Guide](https://dash.plotly.com) which contains numerous Dash examples, such as for:
  - The [Dash Core Components](https://dash.plotly.com/dash-core-components)
  - The [Dash HTML Components](https://dash.plotly.com/dash-html-components)
  - The [Dash Data Table](https://dash.plotly.com/datatable)

The course book *Interactive Dashboards and Data Apps with Plotly and Dash* is yet another good source:
- The course book's [GitHub Repository](https://github.com/PacktPublishing/Interactive-Dashboards-and-Data-Apps-with-Plotly-and-Dash) contains code examples for each chapter.
- The application that is built throughout the course book is deployed as a [Dash App running in the Cloud](https://povertydata.org).

---
# 5. Submission
---

## 5.1 Checklist

When you are done with your project, use this checklist before submitting the project via Canvas.

- Is your Dash App written as a Python application (i.e. using `.py` files).
  - In the workshops, the various Dash Apps were all contained in a Jupyter Notebook, but your project's final Dash App needs to be an actual Python application.
  - For example, can you Dash App be started from the command line with e.g. `python.exe main.py`?
- Have you included a `README.md` file in your root `project` folder?
  - Does it give clear instructions how to create a Python Virtual Environment for your project?
  - Does it give clear instructions how to install all necessary Python packages into the Virtual Environment?
  - Does it give clear instructions how to run the Dash application?
  - Does it describe any additional files and or folders you have added to your project, i.e. maybe you have a Notebook that does all preprocessing?
- Use the instructions in your `README.md` file to:
  - Create a new Python Virtual Environemnt.
  - Install necessary Python packages.
  - Run your Dash application.
  - Run any necessary Notebooks (or `.py` files), e.g. maybe you have done all preprocessing in a Notebook?
  - Any user should be able run your Dash app or run any other necessary file (`.py` or `.ipynb`) just from reading your `README.md` file.
    - Otherwise, you might have to fix this and resubmit your project.
- Have you completed all the CRISP-DM reports, i.e. all the `Outputs` in the Markdown files (`.md`) for the 6 phases?
- Have you saved all your project files?
- Have you re-run any necessary Notebooks you have added from scratch to make sure they work as intended?
- Have you created an archive file containing all your project files?
  - Accepted archive files are `.zip`, `.rar`, `.7z`, `.tar`, `.gz` or `.bz2`.

---
## 5.2 Submit the Project

- Submit the archive file via [Canvas](https://hb.instructure.com/courses/10009/assignments/40723?module_item_id=227873) before the deadline (only submit one archive file).
  - Once again, accepted archive files are `.zip`, `.rar`, `.7z`, `.tar`, `.gz` or `.bz2`.