# Lab 1.02 - Android Persistence

Import all necessary Python libraries and create a variable `android_persistence` to load the dataset [android_persistence_cpu.csv](https://github.com/HoGentTIN/dsai-en-labs/blob/main/data/android_persistence_cpu.csv). See the [code book](https://github.com/HoGentTIN/dsai-en-labs/blob/main/data/android_persistence_cpu.md) for more info on the contents. Note this file is not stored as a regular CSV file! You may need to tweak the parameters of the import function to load the file correctly.

In [1]:
# Importing the necessary packages
import numpy as np                                  # "Scientific computing"
import scipy.stats as stats                         # Statistical tests

import pandas as pd                                 # Data Frame
from pandas.api.types import CategoricalDtype

import matplotlib.pyplot as plt                     # Basic visualisation
from statsmodels.graphics.mosaicplot import mosaic  # Mosaic diagram
import seaborn as sns                               # Advanced data visualisation

In [10]:
android_persistence = pd.read_csv('../data/android_persistence_cpu.csv', sep=";")
android_persistence.head()

Unnamed: 0,Time,PersistenceType,DataSize
0,1.81,Sharedpreferences,Small
1,1.35,Sharedpreferences,Small
2,1.84,Sharedpreferences,Small
3,1.54,Sharedpreferences,Small
4,1.81,Sharedpreferences,Small


Explore the dataset:

- How many variables and observations are present in the dataset?
- What is the level of measurement of each variable?
- Perform the conversion of the qualitative variables to the appropriate type (and specify the order of ordinal variables).
- List the data types in the dataset.

In [11]:
# How many  rows does the DataFrame have?
print(f"Number of rows: {len(android_persistence)}")
# How many columns?
print(f"Number of columns: {len(android_persistence.columns)}")
# How many rows and columns, i.e. the shape
print(f"The shape of the Data Frame is: {android_persistence.shape}")
# General information about the DataFrame
print("*"*50)
android_persistence.info()

# Give the data type of each column.
print("*"*50)
print(android_persistence.dtypes)

# How many columns of each data type are there?
#   Watch it! The book says to use get_dtype_counts(), but this method no longer exists
print("*"*50)
print(android_persistence.dtypes.value_counts())

Number of rows: 300
Number of columns: 3
The shape of the Data Frame is: (300, 3)
**************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Time             300 non-null    float64
 1   PersistenceType  300 non-null    object 
 2   DataSize         300 non-null    object 
dtypes: float64(1), object(2)
memory usage: 7.2+ KB
**************************************************
Time               float64
PersistenceType     object
DataSize            object
dtype: object
**************************************************
object     2
float64    1
Name: count, dtype: int64


In [14]:
android_persistence.PersistenceType.unique()
persistence_type_categorical = CategoricalDtype(['Sharedpreferences', 'GreenDAO', 'SQLLite', 'Realm'], ordered=True)
android_persistence.PersistenceType = android_persistence.PersistenceType.astype(persistence_type_categorical)

In [16]:
android_persistence.DataSize.unique()
datasize_categorical = CategoricalDtype(['Small', 'Medium', 'Large'], ordered=True)
android_persistence.DataSize = android_persistence.DataSize.astype(datasize_categorical)

Describe each variable.

In [19]:
android_persistence['Time'].describe()
android_persistence['DataSize'].describe()
android_persistence['PersistenceType'].describe()

count          300
unique           4
top       GreenDAO
freq            90
Name: PersistenceType, dtype: object

What unique values are there for the variables `PersistenceType` and `DataSize`? How often does each value occur?

In [27]:
android_persistence.DataSize.value_counts()

DataSize
Small     120
Medium     90
Large      90
Name: count, dtype: int64

In [28]:
android_persistence.PersistenceType.value_counts()


PersistenceType
GreenDAO             90
SQLLite              90
Realm                90
Sharedpreferences    30
Name: count, dtype: int64

In this dataset, it is especially interesting to know how often each unique combination of `PersistenceType` and `DataSize` occurs. Figure out how to use the Pandas function `crosstab()` to create a so-called contingency table for these variables. By the way, this concept will return in Module 4 (examining the relationship between 2 qualitative variables).

In [29]:
a = android_persistence.PersistenceType
b = android_persistence.DataSize

pd.crosstab(a, b)

DataSize,Small,Medium,Large
PersistenceType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Sharedpreferences,30,0,0
GreenDAO,30,30,30
SQLLite,30,30,30
Realm,30,30,30
