<div style="background:#EEEEdd; color:#552255; padding:8px; border-radius: 4px; text-align: center; font-weight: 500;">IFQ619 - Data Analytics for Strategic Decision Makers (2024)</div>

# IFQ619 :: A2-DataAnalyticsCycle

### QDAVI

In our approach to data analytics, we will follow a process that requires that we address 5 questions:

1. Which is the right question?
2. Which is the right data?
3. Which is the right analysis?
4. Which is the right visualisation?
5. Which is the right insight?

For this unit, we are concerned with more than just data analytics, we are interested in what is *appropriate, efficous, ethical ...* what is ***right!***

You can easily remember the data analytics cycle by the acronym **QDAVI**:

1. **Q**uestion
2. **D**ata
3. **A**nalysis
4. **V**isualisation
5. **I**nsight

<img src="graphics/QDAVI_cycle_sm.png" width="50%" />


---

## QDAVI Example

### Question

In this part of the cycle we will:
1. consider the context, and the main concern of stakeholders 
2. identify a specific question (or questions) to address the context and stakeholder concern/s, and 
3. plan how we might connect the question with available data for analysis


#### Context: 

We have many students in IFQ619 coming from different backgrounds and choosing to study for different reasons. A better understanding of the cohort may be helpful for the teaching team, allowing them to adapt the learning experiences to groups with different needs.

> **QUESTION** What different groups of students are studying IFQ619, and how might information about the cohort help the teaching team provide meaningful learning experiences?

#### Plan: 

To answer this question, we might consider the kinds of groups that might be possible within available data. Some thoughts are:
- courses - may provide information on whether students have some IT knowledge
- international or domestic - may indicate whether they have local support for their study
- on campus or virtual lecture - may indicate availability and/or preference for face to face learning.

### Data

In this part of the cycle, we will:
1. Identify appropriate data to address the question
2. Read the data into the analysis environment (Jupyter)
3. Undertake any cleaning or formatting needed for analysis

#### Required libraries

For any data analysis, we need to use existing software that has been loaded into the Jupyter environment in the form of 'libraries', 'packages', or 'modules'. To make these libraries available to your notebook, you need to `import` them.

In [None]:
import ??? as ??? # Dataframes

#### Read in the data

1. Take a look at the data first to identify its structure
2. Use appropriate code to read the data in to Jupyter
3. Display the data to check it was read correctly

In [None]:
# Read the CSV into a dataframe
file_name = ???
class_df = pd.read_csv(f"data/{file_name}",index_col=???)
class_df.shape

In [None]:
# We can take a look at the dataframe by adding the variable name as the last line of a cell
???

We can view more rows of the dataframe by setting the `display.max_rows` option...

In [None]:
pd.set_option('display.max_rows', None)

In [None]:
class_df

### Analysis

In this part of the cycle, we will:
1. Explore the data to identify patterns that may be helpful
2. Select appropriate techniques to address the question
3. Analyse the data using selected techniques
4. Check suitability of results and re-analyse as necessary

As part of this process, we will take into account: 
- the composition of the data and how techniques might yield useful results
- starting with simple approaches and working up to more complex as necessary (and/or feasible)
- the appropriateness of the results - is the analysis yielding useful information that can help answer the question

#### Explore categories

One of the obvious features of the data is that there are categories for each of the columns.

In [None]:
# Explore categories by putting the column name in the groupby function
class_df.groupby(???).count

In [None]:
course_totals_df = class_df.groupby(???).size().reset_index(name=???)
???

In [None]:
# Display descriptive statistics of data
course_totals_df.describe()

### Visualise

At this point, it could be helpful to visualise what we found. Visualisation not only allows us to make visible our analysis to others, it can also help expose further lines of investigation

In [None]:
# Simple visualisation
course_totals_df.plot(kind=???)

While this visualisation is helpful for a quick view, we will usually want to be able to control the quality of the visualisation, and so a dedicated library like `plotly.express` is helpful.

In [None]:
# import the visualisation library
import ??? as px

In [None]:
# Create a visualisation with plotly.express
fig = px.bar(course_totals_df,x=???,y=???,width=???,height=???)
fig.show()

When visualising data, we should consider making our visualisations easy for stakeholders to interpret. One simple way of improving this is to order our data in a logical way.

In [None]:
sorted_df = course_totals_df.sort_values(by=???,ascending=False)
sorted_df

In [None]:
fig = px.bar(???,x=???,y=???,width=600,height=600)
fig.show()

### Analysis (part 2)

It is common to return back to analysis after gaining some insights from initial visualisation. We often need to explore the data further

#### Explore sub-categories
We can also get counts for subgroups, by passing a list to the groupby...

In [None]:
# Explore the various combinations by changing the column names in the list
class_df.groupby(['COURSE',???]).count()

This gives us a `count` of number of group members in each remaining column. To get the size of the group itself, we use the `size' function.

In [None]:
class_df.groupby([???,???]).size()

We can create new dataframes based on our manipulation of the existing dataframe

In [None]:
# Create a new dataframe with a column for the number of students in the course_int combination
course_int_df = class_df.groupby(['COURSE','INT']).size().reset_index(name="COURSE_INT")
course_int_df

In [None]:
# Create a dataframe for course_mode combination
course_mode_df = class_df.groupby(['COURSE',???]).size().reset_index(name="COURSE_MODE")
course_mode_df

### Visualise (part 2)

This new finer grained analysis lends itself to a different kind of visualisation.

In [None]:
# Visualise COURSE - INT data
fig2 = px.bar(???,x=???,y=???,color=???,width=600,height=600)
fig2.show()

Warnings can be annoying at times, but it is helpful to know when libraries you are using may be changing, and that you may need to modify your code in the future. For now, we can suppress the warnings

In [None]:
# Suppress futurewarning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
# Visualise COURSE - MODE data
fig3 = px.bar(???,x='COURSE',y='COURSE_MODE',color=???,width=600,height=600)
fig3.show()

### Insights

To derive insights, we need to think about the analysis and visualisation and relate them to the original question:

> **QUESTION** What different groups of students are studying IFQ619, and how might information about the cohort help the teaching team provide meaningful learning experiences?

- What does our analysis and visualisation tell/show us that is relevant to the initial question?
- Which of the results are the most insightful and why?
- Do we need to do further analysis? Why?
- What further analysis is possible? Should it be considered? Why?
- What are the limitations of the data analytics that we've done. Can we overcome those limitations?


Considering these questions carefully can sometimes result in...
- re-thinking the original question
- looking for additional data to supplement the original data
- undertaking more cleaning of the original data
- pursuing further analysis or visualisation
- (occasionally) abandoning the approach and adopting a new approach

---

## Next steps...

1. Experiment with the data and the code above to try different approaches to the analysis
2. Attend the tutorial, and try the data analytics cycle with a different problem
3. Try out some of the exercises during the week