# Applying the Data Analysis Method to a Research Problem

# 1. Determine Research Objectives and Assess the Situation  <a class="anchor" id="Businessunderstanding"></a>
The first stage of the process is to understand what you want to accomplish from a research perspective. You may have competing objectives and constraints that must be properly balanced. The goal of this stage of the process is to uncover important factors that could influence the outcome of the project. Neglecting this step can mean that a great deal of effort is put into producing the right answers to the wrong questions.

## 1.1. Title <a class="anchor" id="Title"></a>
- It’s important - put a little thought into it!
- Descriptive, interesting, novel
- Appropriate
- There may be a slight change to the proposed title – but nothing major!

## 1.2. Introduction <a class="anchor" id="Introduction"></a> 
- State what the work is about.
- Describe your starting point and the background to the subject: e.g., what research has already been done (if you have to include a Literature Review, this will only be a brief survey).
- What are the relevant themes and issues
- Why is it an important problem? Why are you investigating it now?   
- Explain how you are going to go about responding to the brief. 
- If you are going to test a hypothesis in your research, include this at the end of your introduction. 
- Include a brief outline of your method of enquiry. 
- State the limits of your research and reasons for them.
 

## 1.3.Terminology and Key Words<a class="anchor" id="Terminology"></a>
- A glossary of relevant business terminology
- A glossary of data mining terminology, illustrated with examples relevant to the business problem in question. 


## 1.2.Background <a class="anchor" id="Background"></a>
- Surveys publications (books, journals and sometimes conference papers) on work that has already been done on the topic of your report. 
- It should only include studies that have direct relevance to your research.  
- Introduce your review by explaining how you went about finding your materials, and any clear trends in research that have emerged.  
- Group your texts in themes. Write about each theme as a separate section, giving a critical summary of each piece of work, and showing its relevance to your research. 
- Conclude with how the review has informed your research (things you’ll be building on, gaps you’ll be filling etc.). 


 ## 1.3 Research Question <a class="anchor" id="Research Question"></a>

- What Questions Are We Trying To Answer?
- In general a research question should be clear, focused, and complex.
- When writing this you need to consider how you will limit the study. i.e What are you not going to ask? This should then help in the development of any sub questions.

 ## 1.4 Methodology/Methods <a class="anchor" id="Methodology/Methods"></a>

- State clearly how you carried out your investigation. 
- Explain why you chose this particular method. Is it based on the research in your background section? 
- Include techniques and any equipment you used. for exampl any python libraies or external tools like excel or screen scraping.
- If there were participants in your research, who were they? How many? How were they selected?  
- Write this section concisely but thoroughly.  
- You know what you did, but could a reader follow your description?   

# 2. Stage  Two - Data Understanding <a class="anchor" id="Dataunderstanding"></a>
The second stage of the process requires you to acquire the data listed in the project resources. This initial collection includes data loading, if this is necessary for data understanding. For example, if you use a specific tool for data understanding, it makes perfect sense to load your data into this tool. If you acquire multiple data sources then you need to consider how and when you're going to integrate these.

## 2.1 Initial Data Report <a class="anchor" id="Datareport"></a>
Initial data collection report - 
List the data sources acquired together with their locations, the methods used to acquire them and any problems encountered. Record problems you encountered and any resolutions achieved. This will help both with future replication of this project and with the execution of similar future projects.

In [2]:
# Import Libraries Required
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

In [5]:
#Data source: 
#Source Query location: 
pathInfo = 'metacritic_game_info.csv'
pathComments = 'metacritic_game_user_comments.csv'
# reads the data from the file - denotes as CSV, it has no header, sets column headers
dfInfo =  pd.read_csv(pathInfo) 
dfComments = pd.read_csv(pathComments)

display(dfComments)

Unnamed: 0.1,Unnamed: 0,Title,Platform,Userscore,Comment,Username
0,0,The Legend of Zelda: Ocarina of Time,Nintendo64,10,"Everything in OoT is so near at perfection, it...",SirCaestus
1,1,The Legend of Zelda: Ocarina of Time,Nintendo64,10,I won't bore you with what everyone is already...,Kaistlin
2,2,The Legend of Zelda: Ocarina of Time,Nintendo64,10,Anyone who gives the masterpiece below a 7 or ...,Jacody
3,3,The Legend of Zelda: Ocarina of Time,Nintendo64,10,I'm one of those people who think that this is...,doodlerman
4,4,The Legend of Zelda: Ocarina of Time,Nintendo64,10,This game is the highest rated game on Metacr...,StevenA
5,5,The Legend of Zelda: Ocarina of Time,Nintendo64,10,I think it's funny that you have Zelda haters ...,joei1382
6,6,The Legend of Zelda: Ocarina of Time,Nintendo64,9,I played A Link To The Past so many times in m...,Corvix
7,7,The Legend of Zelda: Ocarina of Time,Nintendo64,10,The Legend of Zelda: Ocarina of Time is withou...,pittsburghboy91
8,8,The Legend of Zelda: Ocarina of Time,Nintendo64,10,"This review contains spoilers, cli...",Nosidda89
9,9,The Legend of Zelda: Ocarina of Time,Nintendo64,10,I'm not kidding when I say that this is the on...,Regeneration13


## 2.2 Describe Data <a class="anchor" id="Describedata"></a>
Data description report - Describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Evaluate whether the data acquired satisfies your requirements.

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.head(5)

## 2.3 Verify Data Quality <a class="anchor" id="Verifydataquality"></a>

Examine the quality of the data, addressing questions such as:

- Is the data complete (does it cover all the cases required)?
- Is it correct, or does it contain errors and, if there are errors, how common are they?
- Are there missing values in the data? If so, how are they represented, where do they occur, and how common are they?

### 2.3.1. Missing Data <a class="anchor" id="MissingData"></a>
In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column 

While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem

In [None]:
df.isnull().sum()

In [None]:
def missing_values_table(df):
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        return mis_val_table_ren_columns

In [None]:
missing_values_table(df)

In [None]:
# Get the columns with > 50% missing
missing_df = missing_values_table(df);
missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('We will remove %d columns.' % len(missing_columns))

In [None]:
# Drop the columns
df = df.drop(list(missing_columns))

### 2.3.2. Outliers <a class="anchor" id="Outliers"></a>
At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:

https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

- Below the first quartile − 3 ∗ interquartile range
- Above the third quartile + 3 ∗ interquartile range

## 2.4 Initial Data Exploration  <a class="anchor" id="Exploredata"></a>
During this stage you'll address data mining questions using querying, data visualization and reporting techniques. These ***may*** include:

- **Distribution** of key attributes (for example, the target attribute of a prediction task)
- **Relationships** between pairs or small numbers of attributes
- Results of **simple aggregations**
- **Properties** of significant sub-populations
- **Simple** statistical analyses

These analyses may directly address your data mining goals. They may also contribute to or refine the data description and quality reports, and feed into the transformation and other data preparation steps needed for further analysis. 

- **Data exploration report** - Describe results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate you could include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.

### 2.4.1 Distributions  <a class="anchor" id="Distributions"></a>

In [None]:
def count_values_table(df):
        count_val = df.value_counts()
        count_val_percent = 100 * df.value_counts() / len(df)
        count_val_table = pd.concat([count_val, count_val_percent.round(1)], axis=1)
        count_val_table_ren_columns = count_val_table.rename(
        columns = {0 : 'Count Values', 1 : '% of Total Values'})
        return count_val_table_ren_columns

In [None]:
# Histogram
def hist_chart(df, col):
        plt.style.use('fivethirtyeight')
        plt.hist(df[col].dropna(), edgecolor = 'k');
        plt.xlabel(col); plt.ylabel('Number of Entries'); 
        plt.title('Distribution of '+col);

In [None]:
col = 'account_risk_band'
# Histogram & Results
hist_chart(df, col)
count_values_table(df.account_risk_band)

### 2.4.2 Correlations  <a class="anchor" id="Correlations"></a>
Can we derive any correlation from this data-set. Pairplot chart gives us correlations, distributions and regression path
Correlogram are awesome for exploratory analysis. It allows to quickly observe the relationship between every variable of your matrix. 
It is easy to do it with seaborn: just call the pairplot function

Pairplot Documentation cab be found here: https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
#Seaborn allows to make a correlogram or correlation matrix really easily. 
#sns.pairplot(df.dropna().drop(['x'], axis=1), hue='y', kind ='reg')

#plt.show()


In [None]:
#df_agg = df.drop(['x'], axis=1).groupby(['y']).sum()
df_agg = df.groupby(['y']).sum()

## 2.5 Data Quality Report <a class="anchor" id="Dataqualityreport"></a>
List the results of the data quality verification. If quality problems exist, suggest possible solutions. Solutions to data quality problems generally depend heavily on both data and business knowledge.

# 3. Stage Three - Data Preperation <a class="anchor" id="Datapreperation"></a>
This is the stage of the project where you decide on the data that you're going to use for analysis. The criteria you might use to make this decision include the relevance of the data to your data mining goals, the quality of the data, and also technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

## 3.1 Select Your Data <a class="anchor" id="Selectyourdata"></a>
This is the stage of the project where you decide on the data that you're going to use for analysis. The criteria you might use to make this decision include the relevance of the data to your data mining goals, the quality of the data, and also technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

Rationale for inclusion/exclusion - List the data to be included/excluded and the reasons for these decisions.

In [None]:
X_train_regr = df.drop(['date_maint', 'account_open_date'], axis = 1)
X_train = df.drop(['target', 'date_maint', 'account_open_date'], axis = 1)
X_test = test.drop(['date_maint', 'account_open_date'], axis = 1)

## 3.2 Clean The Data <a class="anchor" id="Cleansethedata"></a>
This task involves raise the data quality to the level required by the analysis techniques that you've selected. This may involve selecting clean subsets of the data, the insertion of suitable defaults, or more ambitious techniques such as the estimation of missing data by modelling.

### 3.2.1 Label Encoding <a class="anchor" id="labelEncoding"></a>
Label Encoding to turn Categorical values to Integers

An approach to encoding categorical values is to use a technique called label encoding. Label encoding is simply converting each value in a column to a number. For example, the body_style column contains 5 different values. We could choose to encode it like this:

convertible -> 0
hardtop -> 1
hatchback -> 2
sedan -> 3
wagon -> 4
http://pbpython.com/categorical-encoding.html

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
for col in CAT_COLS:
        encoder = LabelEncoder()
        X_train[col] = encoder.fit_transform(X_train[col].astype(str))
        X_test[col] = encoder.transform(X_test[col].astype(str))

In [None]:
df["column"] = df["column"].astype('category')
df.dtypes

In [None]:
df["column"] = df["column"].cat.codes
df.head()

### 3.2.2 Drop Unnecessary Columns <a class="anchor" id="DropCols"></a>
Sometimes we may not need certain columns. We can drop to keep only relevent data

In [None]:
del_col_list = ['col1', 'col2']

df = df.drop(del_col_list, axis=1)
df.head()

### 3.2.3 Altering Data Types <a class="anchor" id="AlteringDatatypes"></a>
Sometimes we may need to alter data types. Including to/from object datatypes

In [None]:
#df['date'] = pd.to_datetime(df['date'])

### 3.2.4 Dealing With Zeros <a class="anchor" id="DealingZeros"></a>
Replacing all the zeros from cols. **Note** You may not want to do this - add / remove as required

In [None]:
#cols = ['col1', 'col2']
#df[cols] = df[cols].replace(0, np.nan)

In [None]:
# dropping all the rows with na in the columns mentioned above in the list.

# df.dropna(subset=cols, inplace=True)


### 3.2.5 Dealing With Duplicates <a class="anchor" id="DealingDuplicates"></a>
Remove duplicate rows. **Note** You may not want to do this - add / remove as required

In [None]:
#df = df.drop_duplicates(keep='first')

## 3.3 Construct Required Data   <a class="anchor" id="Constructrequireddata"></a>
This task includes constructive data preparation operations such as the production of derived attributes or entire new records, or transformed values for existing attributes.

**Derived attributes** - These are new attributes that are constructed from one or more existing attributes in the same record, for example you might use the variables of length and width to calculate a new variable of area.

**Generated records** - Here you describe the creation of any completely new records. For example you might need to create records for customers who made no purchase during the past year. There was no reason to have such records in the raw data, but for modelling purposes it might make sense to explicitly represent the fact that particular customers made zero purchases.


## 3.4 Integrate Data  <a class="anchor" id="Integratedata"></a>
These are methods whereby information is combined from multiple databases, tables or records to create new records or values.

**Merged data** - Merging tables refers to joining together two or more tables that have different information about the same objects. For example a retail chain might have one table with information about each store’s general characteristics (e.g., floor space, type of mall), another table with summarised sales data (e.g., profit, percent change in sales from previous year), and another with information about the demographics of the surrounding area. Each of these tables contains one record for each store. These tables can be merged together into a new table with one record for each store, combining fields from the source tables.

**Aggregations** - Aggregations refers to operations in which new values are computed by summarising information from multiple records and/or tables. For example, converting a table of customer purchases where there is one record for each purchase into a new table where there is one record for each customer, with fields such as number of purchases, average purchase amount, percent of orders charged to credit card, percent of items under promotion etc.


## 3.5 Primary Data Set  <a class="anchor" id="Primary Data Set"></a>
Construct Our Primary Data Set, this is the pre-processed data set that will be used for the data modeling experiments.

# 4. Modelling <a class="anchor" id="Modelling"></a>
As the first step in modelling, you'll select the actual modelling technique that you'll be using. Although you may have already selected a tool during the business understanding phase, at this stage you'll be selecting the specific modelling technique e.g. Association Rules with Apriori,  decision-tree building with C5.0, Clustering with K-Meand or neural network generation with back propagation. If multiple techniques are applied, perform this task separately for each technique.



## 4.1. Modelling technique <a class="anchor" id="ModellingTechnique"></a>
Document the actual modelling technique that is to be used.

Import Models below:

## 4.2. Modelling assumptions <a class="anchor" id="ModellingAssumptions"></a>
Many modelling techniques make specific assumptions about the data, for example that all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any assumptions made.

- 
- 


## 5.3. Build Model <a class="anchor" id="BuildModel"></a>
Run the modelling tool on the prepared dataset to create one or more models.

**Parameter settings** - With any modelling tool there are often a large number of parameters that can be adjusted. List the parameters and their chosen values, along with the rationale for the choice of parameter settings.

**Models** - These are the actual models produced by the modelling tool, not a report on the models.

**Model descriptions** - Describe the resulting models, report on the interpretation of the models and document any difficulties encountered with their meanings.

## 6. Results/Data/Findings <a class="anchor" id="Results"></a>
- This section has only one job, which is to present the findings of your research as simply as possible. 
- Use the format that will achieve this most effectively: e.g. text, graphs, tables or diagrams. 
- Don’t repeat the same information in two visual formats (e.g. a graph and a table).  
- Label your graphs and tables clearly. 
- Give each figure a title and describe in words what the figure demonstrates. 
- Save your interpretation of the results for the Discussion section. 
- In most data mining projects a single technique is applied more than once and data mining results are generated with several different parameters 
- At this stage you should rank the models and assess them according to the evaluation criteria. 


# 6. Discussion <a class="anchor" id="Discussion"></a>	

Discussion
- This is probably the longest writing section. 
- It brings everything together, showing how your findings respond to the brief you explained in your introduction and the previous research you surveyed in your literature review. 
- This is the place to mention if there were any problems (e.g. your results were different from expectations, you couldn’t find important data, or you had to change your method or participants) and how they were, or could have been, solved.
- Interpret the models according to your domain knowledge, your data mining success criteria and your desired test design. 
- Judge the success of the application of modelling and discovery techniques technically, then in the business context. 




# 7. Conclusion <a class="anchor" id="Conclusion"></a>
- Should be a short section with no new arguments or evidence. 
- Sum up the main points of your research. How do they answer the original brief for the work reported on? This section may also include: 
    - Recommendations for action 
    - Suggestions for further research 

# 8. Reference List/Bibliography <a class="anchor" id="Reference"></a>

- List full details for any works you have referred to in the report. 
- For the correct style of referencing to use, check college guidelines.  
- If you are uncertain about how or when to reference, see the college library referencing guide.
