# Introduction

As the term **“univariate”** suggests, this session deals with analysing variables one at a time. It is important to separately understand each variable before moving on to analysing multiple variables together.

### Data Description:

Given a data set, the first step is to understand what it contains. Information about a data set can be gained simply by looking at its metadata. Metadata, in simple terms, is the data that describes each variable in detail. Information such as the size of the data set, how and when the data set was created, what the rows and variables represent, etc. are captured in the metadata.  

![MedaData.jpg](attachment:MedaData.jpg)

#### Q: Ordered and Unordered Categorical Variables
Categorical variables can be of two types - ordered categorical and unordered categorical. In unordered, it is not possible to say that a certain category is 'more or less' or 'higher or lower' than others. For example, color is such a variable (red is not greater or more than green etc.)

On the other hand, ordered categories have a notion of 'higher-lower', 'before-after', 'more-less' etc. For e.g. the age-group variable having three values - child, adult and old is ordered categorical because an old person is 'more aged' than an adult etc. In general, it is possible to define some kind of ordering.

Which of the following variables are ordered categorical?

**Ans**:The months in a year - Jan, Feb, March etc.

Months have an element of ordering - Jan comes before April, Dec comes after everything else etc. In general, all dates are ordered categorical variables (day 23 comes after day 11 of the month etc.)


### Types of variables :-

1. **Categorical variables**:
    - Ordered categorical 
        - Salary of the Emp. >> High-Medium-Low
        - Month >> Jan-Feb...Dec
        
    - Un-ordered categorical 
        - Type of loan >> Home, Personal, Car
        - Organisation of a person >> Sales, marketing, HR etc.
    
    
2. **Quantitative/Numeric variables** - (Continuous):
    - Quantitative variables are simply numeric variables which can be added up, multiplied, divided etc. For example, salary, number of bank accounts, runs scored by a batsman, the mileage of a car etc.
     

#### Q: Categorical Variables
A survey variable called “StateofMind” contains values 1 to 4 denoting the different states of mind of the respondents.

| StateofMind	| Meaning |
| --- | --- |
| 1	| Confused |
| 2	| Happy |
| 3	| Depressed |
| 4	| Confident |
 

Which of the following categories would the variable “StateofMind” fall into?

**Ans**: Unordered categorical

Don't get confused by the numeric values for StateofMind. There is no order between the categories.

# Unordered Categorical Variables - Univariate Analysis


- It is important to note that rank-frequency plots enable you to extract meaning even from seemingly trivial unordered categorical variables such as country, name of an artist, name of a github user etc. 
- The objective here is not to put excessive focus on power laws or rank-frequency plots, but rather to understand that non-trivial analysis is possible even on unordered categorical variables, and that plots can help you out in that process.

- ????

**Why plotting on a log-log scale helps**

 

The objective of using a log scale is to make the plot readable by changing the scale. For example, the first ranked item had a frequency of 29000, the second ranked had 3500, the seventh had 700 and most others had very low frequencies such as 100, 80, 21 etc.  The range of frequencies is too large to fit on the plot.

 

Plotting on a log scale compresses the values to a smaller scale which makes the plot easy to read.

 

This happens because log(x) is a much smaller number than x. For example, log(10) = 1, log(100) = 2, log(1000) = 3 and so on. Thus, log(29000) is now approx. 4.5, log(3500) is approx. 3.5 and so on. What was earlier varying from 29000 to 1 is now compressed between 4.5 and 0, making the values easier to read on a plot.

 

We will not get deeper into power law and what causes it, though you can access the additional material on 'what causes the power law' by Anand here.

 

To summarise, the major takeaways from this lecture are:

- Plots are immensely helpful in identifying hidden patterns in the data 
- It is possible to extract meaningful insights from unordered categorical variables using rank-frequency plots
- Rank-frequency plots of unordered categorical variables, when plotted on a log-log scale, typically result in a power law distribution

# Ordered Categorical Variables - Univariate Analysis


In [None]:
import pandas as pd

tendulkar = pd.read_csv("tendulkar_ODI.csv")
tendulkar.head()

In [19]:
# Check the columns types using 'dtypes'
tendulkar.dtypes


Unnamed: 0     int64
Runs          object
Mins          object
BF            object
4s            object
6s            object
SR            object
Pos           object
Dismissal     object
Inns          object
Opposition    object
Ground        object
Start Date    object
dtype: object

In [9]:
# check whether '4s' column has missing value?
# tendulkar["4s"].isna().sum()

0

In [35]:
# Now convert object column into int using 'df['col_name'].astype(int)'

# tendulkar["4s"] = tendulkar["4s"].astype(int)


tendulkar["Runs"].value_counts()


1      16
2      14
0      12
4       9
21      8
       ..
139     1
70*     1
101     1
87*     1
85      1
Name: Runs, Length: 118, dtype: int64

In [None]:
tendulkar.dtypes

- Histogram or Bar chart - for continuous ordered categorical variables


----

# Recommended Additional Content

- Mean and median: The basics 
    - https://www.khanacademy.org/math/ap-statistics/summarizing-quantitative-data-ap/measuring-center-quantitative/v/mean-median-and-mode


- Mean and median: Advanced 
    - https://www.khanacademy.org/math/statistics-probability/displaying-describing-data/more-mean-median/e/calculating-the-mean-from-various-data-displays
    
    
- Range, Interquartile range (IQR), Mean absolute deviation (MAD)
    - https://www.khanacademy.org/math/statistics-probability/displaying-describing-data/range-iqr-mad/v/range-and-mid-range
    
    
- Population variance and standard deviation
    - https://www.khanacademy.org/math/statistics-probability/displaying-describing-data/pop-variance-standard-deviation/v/range-variance-and-standard-deviation-as-measures-of-dispersion
    
    
- Sample variance and standard deviation
    - https://www.khanacademy.org/math/statistics-probability/displaying-describing-data/sample-standard-deviation/v/sample-variance
    
    
- Basic probability
    - https://www.khanacademy.org/math/statistics-probability/probability-library
    

- Random variables
    - https://www.khanacademy.org/math/statistics-probability/random-variables-stats-library
    


- Un-ordered Categorical 
    - Pick one column pure categorical data and that effectively through log-log plot and to see fit for the power law distribution.
    - We order by it rank
- Ordered Categorical
    - We will look into the frequency
    - Here we will order by natural order

# Quantitative Variables - Univariate Analysis


We can also check in two ways:
- Ordered categorical
    - We can create bins
    - Marks is continuous variables, for eg: 1, 2, 3 instead of 0.5, 0.8, 0.1


#### Q1: Numeric and Ordered Categorical Variables
Anand mentioned that you can treat numeric variables as ordered categorical variables. For analysis, you can deliberately convert numeric variables into ordered categorical, for example, if you have incomes of a few thousand people ranging from $5,000 to $100,000, you can categorise them into bins such as [5000, 10000], [10000,15000] and [15000, 20000].

This is called 'binning'. 

Which of the following variables can be binned into ordered categorical variables? Mark all the correct options.

**Ans**:
- The temperature in a city over a certain time period
    - You can bin the temperatures as [0, 10 degrees], [10, 20 degrees] etc.
- The revenue generated per day of a company
    - This can also be binned e.g. [0, 10k], [10k, 20k] etc.
 

#### Q2: Which metric to use
Consider the example Anand is discussing — there are a group of middle-class IT employees and Bill Gates in a room. If you want to get a rough sense of the income made by a typical IT employee, which metric would you choose?

**Ans**:
- Median
    - The average, calculated taking Bill Gates’ income into account, would be an overwhelmingly large number far from the income of any other IT employee in the room. On the other hand, the median will not consider Bill Gates’ income and would be more representative.
    
    
#### Q3: Median
Let’s consider a sample which contains the ages of the students in this program: 36, 42, 32, NA, 22, NA, 25. The median age of this set is:

**Ans**: 32

After removing the NA values and ordering, there are 5 quantities left. Thus, the median is the third value, i.e. 32.

While mean gives an average of all the values, median gives a typical value that could be used to represent the entire group. As a simple rule of thumb, always question someone if someone uses the mean, since median is almost always a better measure of ‘representativeness’.

#### Descriptive data summary

| Tearms | Description |
| :--- | :--- |
| First(Q1) and Third(Q3) Quantile | Q1: Value at 25th percentile of the range. |
| | Q3: Value at 75th percentile of the range. |
| Median | Middle value of a set of data, i.e. value at 50th percentile |
| Mean | Average value of a set of data |
| Mode | Value that occurs most often in a set of data |
| Variance | Measure of deviation of the data points from the mean |
| Standard Deviation (SD) | Measure of deviation of the data points from the mean of the set SD = √(Variance) |



#### Q: Which metric to use
Which of the following metrics can be used in case of unordered categorical data?

**Ans**: Mode

Mode is the value with the maximum frequency. In unordered categorical variables, any order or difference between values is not defined. So, using median and mean make no sense here.

# Box Plot explaination:



![image.png](attachment:image.png)



![image.png](attachment:image.png)

![image.png](attachment:image.png)


- Standard deviation and interquartile difference are both used to represent the spread of the data.


- Interquartile difference is a much better metric than standard deviation if there are outliers in the data. 


- This is because the standard deviation will be influenced by outliers while the interquartile difference will simply ignore them.

#### Q: Quantiles
Look at the following marks for a course exam (out of 100).

|  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- |
|Quantiles|	min|10%|25%|50%|75%|90%|max|
|Final Marks|	10|	48|	55|	66|	78|	87|	93|
 

Which of the following statements is FALSE?

**Option***: 
- About 1/4 of the class received a score of 55 or less
- About 3/4 of the class received a score of 78 or less
- About 50% of the class received grades between 55 and 78
- About 1/3 of the class received a score of 48 or less

**Ans**: About 1/3 of the class received a score of 48 or less

About 10% of the class received a score of 48 or less


# Quantitative Variables - Summary Metrics




You saw that quartiles are a better measure of the spread than the standard deviation. A good way to visualise quartiles is to use a box plot. The 25th percentile is represented by the bottom horizontal line of the box, the 75th is the top line of the box, and the median is the line between them.  

1. Shares vs Weekend 

![image.png](attachment:image.png)

2. Shares vs Weekday

![image.png](attachment:image.png)

3.  Shares vs Channel type

![image.png](attachment:image.png)

#### Q1: Shares vs Weekend
The variation in the number of shares is larger for:


**Option***: 
- Weekends
     - The height of the box, i.e. the difference between the 75th and the 25th percentiles is larger for weekends.
- Weekdays


**Ans**: Weekends

The height of the box, i.e. the difference between the 75th and the 25th percentiles is larger for weekends.



#### Q2: Shares across Weekdays
Say you want to compare the number of shares of articles across weekdays. Select the correct statement.


**Option***: 
- There is a significant difference in the spread of the number of shares across weekdays

- Articles published on Mondays get significantly more popular than those on other days

- There is no significant difference in either the median or the spread of the number of shares across weekdays


**Ans**: There is no significant difference in either the median or the spread of the number of shares across weekdays

The median and both the 25th and 75th percentiles are almost similar across weekdays, indicating that there is no significant difference in either the median or the amount of spread. Note that the means may be significantly different because of some high values, but that is a deceptive measure to look at.


#### Q3: Shares across Channel Types
Which types of articles are most likely to reach 2500 shares?


**Option***: 
Social Media

Technology

Lifestyle

World


**Ans**: 
- Social Media
   - Social media articles are clearly shared more than others - approximately 60% articles on social media reach 2500 shares, whereas even the 75th percentile of other channels barely reaches 2,500 shares.


#### Q1: Mode
The mode of a categorical variable is the value (category) that occurs the most often.

What is the mode of the num_keywords variable in the News Popularity data set?


**Option***: 
7

6

10

8


**Ans**: 7
Articles with 7 keywords are more in number than articles with any other number of keywords. 7322 articles have 7 keywords.



#### Q2: Mean
What is the average number of times the articles in the data set were shared, i.e what is the mean of the shares?


**Option***: 2427

7322

3395

5000


**Ans**: 3395
The articles were shared approx. 3395 times on average.



#### Q3: Median
What is the median value of the shares?


**Option***: 1500

1400

7322

3598


**Ans**: 1400
The median value of the shares is 1400.




#### Q4: Median vs Mean
Why do you think there is a huge difference between the mean and median of the shares? Which of these metrics is more representative of the shares? Write your answer in the text box below. (Word limit: 200)


**Option***: 


**Ans**: Mean and median has huge difference due to the outlier in the data, so if we remove it then it will not such difference.




# Summary

Let's summarise what you learnt:

Metadata description describes the data in a structured way. You should make it a habit of creating a metadata description for whatever data set you are working on. Not only will it serve as a reference point for you, it will also help other people understand the data better and save time.

Distribution plots reveal interesting insights about the data. You can observe various visible patterns in the plots and try to understand how they came to be.

Summary metrics are used to obtain a quantitative summary of the data. Not all metrics can be used everywhere. Thus, it is important to understand the data and then choose what metric to use to summarise the data.