In [1]:
pip install mlxtend  

Collecting mlxtend
  Downloading mlxtend-0.23.0-py3-none-any.whl (1.4 MB)
                                              0.0/1.4 MB ? eta -:--:--
     --                                       0.1/1.4 MB 2.6 MB/s eta 0:00:01
     ------                                   0.2/1.4 MB 2.8 MB/s eta 0:00:01
     ---------                                0.4/1.4 MB 2.8 MB/s eta 0:00:01
     -------------                            0.5/1.4 MB 2.7 MB/s eta 0:00:01
     ----------------                         0.6/1.4 MB 2.7 MB/s eta 0:00:01
     --------------------                     0.7/1.4 MB 2.7 MB/s eta 0:00:01
     ------------------------                 0.9/1.4 MB 2.7 MB/s eta 0:00:01
     -------------------------                0.9/1.4 MB 2.7 MB/s eta 0:00:01
     -----------------------------            1.1/1.4 MB 2.6 MB/s eta 0:00:01
     ---------------------------------        1.2/1.4 MB 2.6 MB/s eta 0:00:01
     ------------------------------------     1.3/1.4 MB 2.6 MB/s eta 0:00:

In [2]:
import pandas as pd
import numpy as np

# Scaling modules
from mlxtend.preprocessing import minmax_scaling, standardize

# Plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# Ensures the same random data is used each time you execute the code
np.random.seed(0) 

In [6]:
# Read in data 
df = pd.read_csv('store_income_data_task.csv')

In [7]:
# Explore data
df.shape

(1000, 7)

In [8]:
df.head(5)

Unnamed: 0,id,store_name,store_email,department,income,date_measured,country
0,1,"Cullen/Frost Bankers, Inc.",,Clothing,$54438554.24,4-2-2006,United States/
1,2,Nordson Corporation,,Tools,$41744177.01,4-1-2006,Britain
2,3,"Stag Industrial, Inc.",,Beauty,$36152340.34,12-9-2003,United States
3,4,FIRST REPUBLIC BANK,ecanadine3@fc2.com,Automotive,$8928350.04,8-5-2006,Britain/
4,5,Mercantile Bank Corporation,,Baby,$33552742.32,21-1-1973,United Kingdom


In [9]:
# Get the number of missing data points per column
missing_values_count = df.isnull().sum()

# Look at the number of missing points in the first ten columns
missing_values_count[0:10]

id                 0
store_name         0
store_email      587
department        27
income             0
date_measured      0
country           35
dtype: int64

In [10]:
# Total number of missing values
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()

# Percent of data that is missing
(total_missing/total_cells) * 100

9.27142857142857

In [11]:
# Create a temporary dataframe
temp_df = df

# Remove all the rows that contain a missing value.
temp_df.dropna().head()

Unnamed: 0,id,store_name,store_email,department,income,date_measured,country
3,4,FIRST REPUBLIC BANK,ecanadine3@fc2.com,Automotive,$8928350.04,8-5-2006,Britain/
5,6,"Auburn National Bancorporation, Inc.",ccaldeyroux5@dion.ne.jp,Grocery,$69798987.04,19-9-1999,U.K.
6,7,"Interlink Electronics, Inc.",orodenborch6@skyrock.com,Garden,$22521052.79,8-6-2001,SA
9,10,"Synopsys, Inc.",lcancellieri9@tmall.com,Electronics,$44091294.62,11-7-2006,United Kingdom
15,16,New Home Company Inc. (The),nhinchcliffef@whitehouse.gov,Shoes,$90808764.99,21-4-1993,Britain


In [12]:
temp_df.dropna().shape

(384, 7)

Note on why the missing data on the following three columns: store_email, department and country.


store_email Column:
The absence of data in the store_email column may be attributed to a Missing Completely at Random (MCAR) mechanism.

Explaination: The missing email addresses appear to be unrelated to any discernible pattern or observed variables in the dataset. It's conceivable that during the data collection process, either due to privacy considerations or technical issues, some stores chose not to disclose their email addresses. As this missingness seems to occur randomly across the dataset without dependence on other variables, it aligns with the MCAR mechanism.

department Column:
The missing data in the department column is likely governed by a Missing at Random (MAR) mechanism.

Explaination: The likelihood of missing department information appears to be related to other observed variables in the dataset, such as income or country. It's plausible that certain stores, depending on their income or geographical location, may be less likely to categorize themselves into specific departments. Since this missingness can be explained by other variables present in the dataset, it aligns with the MAR mechanism.

country Column:
The missing data in the country column may be indicative of a Missing Not at Random (MNAR) mechanism.

Explaination: The probability of missing country information seems to be related to unobserved factors not present in the dataset. This could be due to a variety of reasons, such as stores with specific characteristics being less likely to disclose their country of operation. As the missingness is not fully explainable by the observed variables in the dataset, it suggests a potential MNAR mechanism.