In [1]:
import os
import pandas as pd
import numpy as np
import json


#pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Overview

The Houston Rockets collect a wide range of data including ticket transactions, retail sales, and fan surveys. However, this data comes from various sources with differing formats, making it difficult to truly understand who our fans are and track how they’re interacting with the Houston Rockets.

Use the available data sources to create a unified database table (i.e., a single table that our Business Intelligence & Innovation team could leverage to build fan segments and determine their behaviors).

**Requirements**

Using your programming method of choice, create a unified database table that could be used as the basis for dashboards and reporting on fan segments and their behaviors.

At minimum, the database table should include the following:

- A unique identifier for each fan as the primary key
- Fan identifiers from each data source
- Fields containing contact information for each fan (email, phone number, and zip code)

The following calculated fields:
- Number of ticket transactions
- Number of retail transactions
- Number of survey responses

At least four additional calculated fields. For example:
- Average ticket price for each fan
- Fan total spend


Project is available on Github, or similar SVN service, with a README on how to locally view and/or run your project. If a private repo, which we would encourage, please add @mkamla as a collaborator when your project is ready for review.
Timely completion of the project. Preferably no more than 7 days from delivery of project details.

**Evaluation Criteria**

Aside from adherence to the requirements, below are specific aspects that will be evaluated:

- Inclusion of supporting files, documentation and scripts used to generate the unified table
- Thoughtful consideration to datapoints that are relevant to a sports, entertainment and/or event business

**Bonus Points**
Additional consideration will be given to projects that include details about your methodology or approach, insights uncovered, supplemental tables and creativity in incorporating external resources that are additive to the project requirements and may reside outside the scope of this document.

The difference between ordinary and extraordinary is a little extra.



## Acquistion

In [5]:
# Import JSON file
json_data = pd.read_json('retail.json', orient='columns')

# Import CSV files
survey_data = pd.read_csv('surveys.csv')
ticket_data = pd.read_csv('tickets.csv')


# Print the first few rows of each data frame

print("JSON - retail data:")
print(json_data.head())

print("CSV - survey data:")
print(survey_data.head())

print("CSV - ticket data :")
print(ticket_data.head())

JSON - retail data:
                                              retail
0  {'transaction_id': 1, 'email': 'user18@rockets...
1  {'transaction_id': 2, 'email': 'user142@rocket...
2  {'transaction_id': 3, 'email': 'user182@rocket...
3  {'transaction_id': 4, 'email': 'user492@rocket...
4  {'transaction_id': 5, 'email': 'user101@rocket...
CSV - survey data:
   Submission ID                                          Attribute            Value
0              1                                           phone_no     290-551-1299
1              1                                           event_id             3220
2              1             how_satisfied_were_you_with_this_event                2
3              1  how_satisfied_were_you_with_your_retail_experi...                3
4              1  how_likely_are_you_to_attend_this_event_in_the...  5 - Very Likely
CSV - ticket data :
   transaction_id  account_no                email    zip      phone_no  section  row  qty  total_price  event_id

In [4]:
print("Info - retail data :")
json_data.info()

print("Info - survey data :")
survey_data.info()

print("Info - ticket data :")
ticket_data.info()

Info - retail data :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   retail  2000 non-null   object
dtypes: object(1)
memory usage: 15.8+ KB
Info - survey data :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12000 entries, 0 to 11999
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Submission ID  12000 non-null  int64 
 1   Attribute      12000 non-null  object
 2   Value          12000 non-null  object
dtypes: int64(1), object(2)
memory usage: 281.4+ KB
Info - ticket data :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   transaction_id  10000 non-null  int64 
 1   account_no      10000 non-null  object
 2   email           10

**Takeaways**

- Retail JSON 

    - No nulls.
    - Currently 2000 rows and 1 column.

     **Things to Do**

    - Need to convert to Dataframe with the following fields:
        - transaction_id, email, account_no , product_type, quantity, unit_price , shipping cost. 
        - Dataframe will have 2000 rows with 7 columns.
        

- Ticket data

    - No nulls
    - Dataframe is 10,000 rows and 11 columns

- Survey data

    - No nulls
    - Need to pivot data on 'Submission ID' index and 'Attribute' being columns. Final shape will be determine after transforming the data table.
    - normalize some data fields

**MVP**

- Master Fan Dataframe

    - need Unique ID for each fan as primary key
    - ID for each data table 
    - Contact information for each fan
        - email
        - phone number
        - zip code
    - Fields calculating 
        - Number of ticket transactions
        - Number of retail transactions
        - Number of surveys completed
    - Additional fields (Need 4)
        - Avg ticket price per fan
        - Fan overall total (ticket + retail)
        - Mode of seating (section) attendence (Maybe) per fan
        - last additional field to be determine.



## Prepare the data