**

![](designer.jpeg)

# Daily Note - 17/06/2024

##  Read Rdata files in Python

You can use the `rpy2` library, which provides an interface to R from Python.

First, you need to install R in your environment. You can do this by running the following command in your terminal:

```bash
sudo apt update
sudo apt install r-base
```

Then, you can install the `rpy2` library by running the following command:

```bash	
pip install rpy2
```

After installing the library, you can read Rdata files in Python using the following code:

```python
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
import pandas as pd

# Path to the data file
rdata_path = 'absolute_path_to_file/stroke_data.Rdata'

# Ensure the path is correct
assert os.path.exists(rdata_path), f"File not found: {rdata_path}"

# Activate the pandas conversion
pandas2ri.activate()

# Load objects
loaded_objects = robjects.r['load'](rdata_path)
print("Loaded objects:", loaded_objects)

# Access the data variables loaded from R
stroke_train = robjects.r['stroke_train']
stroke_test = robjects.r['stroke_test']
VC_preds = robjects.r['VC_preds']
risk_preds = robjects.r['risk_preds']
riskPredNames = robjects.r['riskPredNames']

# Convert the R data frame to a Pandas DataFrame
df_stroke_train = pandas2ri.conversion.rpy2py(stroke_train)
df_stroke_test = pandas2ri.conversion.rpy2py(stroke_test)
df_VC_preds = pandas2ri.conversion.rpy2py(VC_preds)
df_risk_preds = pandas2ri.conversion.rpy2py(risk_preds)
df_riskPredNames = pandas2ri.conversion.rpy2py(riskPredNames)
```	


## Seaborn - Multiple hue levels in histograms

The `sns.histplot` function from the seaborn library to plot histograms with the `hue` parameter, and the multiple parameter, controls how the different hue levels are plotted in relation to each other. By default, it will plot the histograms for each hue level overlapping each other. The bars for different hue levels will be semi-transparent and overlayed on top of each other.

When you use `multiple='stack'`, the histograms for each `hue` level are stacked on top of each other. This makes it easier to compare the total counts across bins and see the proportion of each hue level within each bin.

## Difference between StratifiedShuffleSplit and StratifiedKFold

In a multi-class classification problem, preserving the class distribution is important in you cross-validation strategy. 

`StratifiedKFold` is a variation of `KFold` which divides the datast into `k` folds, ensuring that each fold has the same proportion of classes. This helps to maintain the balance of classes in each fold, which is important when you have imbalanced classes.

`StratifiedShuffleSplit` also preserve the class distribution, but instead of splitting the dataset into `k`folds, it creates several random splits, each preserving the class distribution. It shuffle the data and then splits into training and validation sets while maintaining the balance of classes.

If you want to ensure that every instance is used for both training and validation exactly once, you should use `StratifiedKFold`. 

![StratifiedKFold](stkf_image.png)

If you want to create multiple random splits of the dataset, you should use `StratifiedShuffleSplit`.

![StratifiedShuffleSplit](stss_image.png)

If you have a large dataset, `StratifiedKFold` is a good choice. If you have a small dataset and you want to get multiple independent training/validation splits without the constrain of `k-fold`, `StratifiedShuffleSplit` could be more appropriate. 


## Reproducibility in Machine Learning

Reproducibility is a key aspect in Machine Learning projects by helping you to test hypothesis. As you control the randomness of your model, you can better understand the impact of changes in your features or hyperparameters.

### Same random seed in Kaggle competitions

If everyone uses the same random seed, the results will be overfitting the public leaderboard in a Kaggle competition and most of the submision will end up being quite similar.

### Set up different seed

Your goal in a kaggle competitions is to get a good score in the private leaderboard, so you should always try to do things in a different way from everyone else. Setting up different seeds is a good way to do that.

## Design a good cross-validation strategy

You need to get a good validation strategy as soon as you can because the public leaderboard in most of the Kaggle competitions is misleading. 

Most Kaggle competitions divides the test set into public and private part. The public part (also very small part) is used to calculate the public leaderboard, while the private part is used to calculate the final score. The split between public and private part is random and we don't have information about which part is public and which part is private in the test set.

Considering that the public leaderboard has the same distribution as the private leaderboard, the lack of instances in either set could make difficult to correlate the public leaderboard with the private leaderboard results.

Some Kagglers recommend to use the public test as a holdout test in order to avoid overfitting to the public leaderboard in what is called adaptative overfitting.



## Categorical variables

A categorical








Simple categorical variables can also be classified as ordered or unordered.

A variable
with values “Bad”, “Good”, and “Better” shows a clear progression of values. While
the difference between these categories may not be precisely numerically quantifiable,
there is a meaningful ordering. To contrast, consider another variable that takes
values of “French”, “Indian”, or “Peruvian”. These categories have no meaningful
ordering.

 A large majority of models require that all predictors be
numeric. There are, however, some exceptions. Algorithms for tree-based models
can naturally handle splitting numeric or categorical predictors. 

these algorithms
employ a series if/then statements that sequentially split the data into groups.

a naive Bayes model (Section 12.1) can create a cross-tabulation
between a categorical predictor and the outcome class and this frequency distribution
is factored into the model’s probability calculations. 

the OkCupid data set contains a
large number of potential categorical predictors, many of which have a low prevalence


The most basic approach to representing categorical values as numeric data is to
create dummy or indicator variables.

 the mathematical function required to make the translation is
often referred to as a contrast or parameterization function. An example of a contrast
function is called the “reference cell” or “treatment” contrast, where one of the values
of the predictor is left unaccounted for in the resulting dummy variables. Using
Sunday as the reference cell, the contrast function would create six dummy variables

Why only six? There are two related reasons. First, if the values of the six dummy
variables are known, then the seventh can be directly inferred. The second reason is more technical. When fitting linear models, the design matrix X is created. When
the model has an intercept, an additional initial column of ones for all rows is
included. Estimating the parameters for a linear model (as well as other similar
models) involves inverting the matrix (X′X). If the model includes an intercept and
contains dummy variables for all seven days, then the seven day columns would add
up (row-wise) to the intercept and this linear combination would prevent the matrix
inverse from being computed (as it is singular). When this occurs, the design matrix
said to be less than full rank or overdetermined. When there are C possible values
of the predictor and only C − 1 dummy variables are used, the matrix inverse can
be computed and the contrast method is said to be a full rank parameterization


What is the interpretation of the dummy variables? That depends on what type of
model is being used

Consider a linear model for the Chicago transit data that only
uses the day of the week in the model with the reference cell parameterization above.
Using the training set to fit the model, the intercept value estimates the mean of
the reference cell, which is the average number of Sunday riders in the training set,
and was estimated to be 3.84K people. The second model parameter, for Monday, is
estimated to be 12.61K. In the reference cell model, the dummy variables represent
the mean value above and beyond the reference cell mean. In this case, the estimate
indicates that there were 12.61K more riders on Monday than Sunday. The overall
estimate of Monday ridership adds the estimates from the intercept and dummy
variable (16.45K rides).


When there is more than one categorical predictor, the reference cell becomes
multidimensional. Suppose there was a predictor for the weather that has only
a few values: “clear”, “cloudy”, “rain”, and “snow”. the interpretation
of each set of dummy variables does not change. The average ridership for a
cloudy Monday would augment the average clear Sunday ridership with the average
incremental effect of cloudy and the average incremental effect of Monday.

Encoding predictors
if there are C categories, what happens when C becomes very large? For example,
ZIP Code in the United States may be an important predictor for outcomes that are
affected by a geographic component. There are more than 40K possible ZIP Codes
and, depending on how the data are collected, this might produce an overabundance
of dummy variables. As mentioned in the
previous section, this can cause the data matrix to be overdetermined and restrict
the use of certain models. Also, ZIP Codes in highly populated areas may have a
higher rate of occurrence in the data, leading to a “long tail” of locations that are
infrequently observed

In [2]:
import pandas as pd

# Sample data
data = {
    'Category': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'E']
}
df = pd.DataFrame(data)

In [3]:
# Calculate the frequency of each category
freq = df['Category'].value_counts()
freq

Category
A    9
C    3
B    2
D    2
E    1
Name: count, dtype: int64

In [4]:

# Calculate the ratio of the most common value to the second most common value
most_common = freq.iloc[0]
second_most_common = freq.iloc[1] if len(freq) > 1 else 0
ratio = most_common / second_most_common if second_most_common > 0 else most_common

print(f"Most common: {most_common}, Second most common: {second_most_common}, Ratio: {ratio}")

Most common: 9, Second most common: 3, Ratio: 3.0


In [5]:
# Convert a categorical variable to dummy variables (binary encoding)
df_dummy = pd.get_dummies(df, columns=['Category'], drop_first=True)

In [6]:
df_dummy

Unnamed: 0,Category_B,Category_C,Category_D,Category_E
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False
6,False,False,False,False
7,False,False,False,False
8,False,False,False,False
9,True,False,False,False


The ratio of the most common category to the second most common category helps identify whether any category is overwhelmingly dominant.

For dummy variables, calculating the ratio of ones to zeros highlights whether one value is too rare, potentially making the variable less useful for modeling.

Categories identified as "too rare" may be merged with other categories or handled differently to prevent them from negatively impacting model performance. The threshold of 19 is a heuristic and can be adjusted based on the specific needs and context of the problem

In [13]:
# Example of evaluating a binary variable (dummy variable)
for col in df_dummy.columns:
    num_ones = df_dummy[col].sum()
    num_zeros = len(df_dummy) - num_ones
    ratio = num_ones / num_zeros if num_zeros > 0 else float('inf')
    if ratio > 19 or ratio < 1/19:
        print(f"Variable {col} is too rare with a ratio of {ratio:.2f}")

df_dummy

Unnamed: 0,Category_B,Category_C,Category_D,Category_E
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,False,False
6,False,False,False,False
7,False,False,False,False
8,False,False,False,False
9,True,False,False,False


Although near-zero variance predictors likely contain little valuable predictive in-
formation, we may not desire to filter these out. One way to avoid filtering these
predictors is to redefine the predictor’s categories prior to creating dummy variables.
Instead, an “other” category can be created that pools the rarely occurring categories,
assuming that such a pooling is sensible..


Another way to combine categories is to use a hashing function (or hashes). Hashes
are used to map one set of values to another set of values and are typically used in
databases and cryptography.  The original values are called the keys
and are typically mapped to a smaller set of artificial hash values. In our context, a
potentially large number of predictor categories are the keys and we would like to
represent them using a smaller number of categories (i.e., the hashes). The number
of possible hashes is set by the user and, for numerical purposes, is a power of 2.
Some computationally interesting aspects to hash functions are as follows:

Hashing involves using a hash function to map input data (keys) to fixed-size values (hashes). The hash function takes an input (or key) and returns a unique fixed-size string of characters, which typically appears random.

Why Use Hashing for Categorical Variables?
Efficiency: When dealing with a large number of categories, traditional encoding methods (like one-hot encoding) can result in high-dimensional sparse matrices. Hashing reduces the dimensionality.
Flexibility: Hashing allows you to control the number of unique values (hashes) to which the original categories are mapped.
Scalability: Useful for high-cardinality categorical variables, where the number of unique categories is very large.

How Hashing Works
Input (Keys): The original categorical values.
Hash Function: A function that maps the input keys to hash values.
Output (Hashes): The hashed values, which are typically fewer in number than the original keys.

Steps to Implement Hashing
Select a Hash Function: Choose a hash function that maps the input keys to a fixed-size output.
Determine Number of Hashes: Choose the number of unique hash values, often a power of 2 (e.g., 16, 32, 64).
Apply Hash Function: Map each category to a hash value.

In [15]:
data = {
    'Category': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
}
df = pd.DataFrame(data)
df

Unnamed: 0,Category
0,A
1,B
2,C
3,D
4,E
5,F
6,G
7,H
8,I
9,J


In [16]:
df.value_counts()

Category
A           2
B           2
C           2
D           2
E           2
F           2
G           2
H           2
I           2
J           2
Name: count, dtype: int64

In [20]:
from sklearn.feature_extraction import FeatureHasher
# Convert each category into a list of tokens
df['Category_list'] = df['Category'].apply(lambda x: [x])
# Initialize FeatureHasher
hasher = FeatureHasher(n_features=8, input_type='string')

In [21]:
hasher

In [22]:
df['Category_list']

0     [A]
1     [B]
2     [C]
3     [D]
4     [E]
5     [F]
6     [G]
7     [H]
8     [I]
9     [J]
10    [A]
11    [B]
12    [C]
13    [D]
14    [E]
15    [F]
16    [G]
17    [H]
18    [I]
19    [J]
Name: Category_list, dtype: object

In [23]:
# Apply hashing to the categorical variable
hashed_features = hasher.transform(df['Category_list'])

In [24]:
hashed_features

<20x8 sparse matrix of type '<class 'numpy.float64'>'
	with 20 stored elements in Compressed Sparse Row format>

In [25]:
# Convert hashed features to a DataFrame for better visualization
hashed_df = pd.DataFrame(hashed_features.toarray(), columns=[f'Hash_{i}' for i in range(8)])

In [26]:
hashed_df

Unnamed: 0,Hash_0,Hash_1,Hash_2,Hash_3,Hash_4,Hash_5,Hash_6,Hash_7
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0
6,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [27]:

# Concatenate the original DataFrame with the hashed features
df = pd.concat([df, hashed_df], axis=1)

In [28]:
df

Unnamed: 0,Category,Category_list,Hash_0,Hash_1,Hash_2,Hash_3,Hash_4,Hash_5,Hash_6,Hash_7
0,A,[A],0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,B,[B],0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0
2,C,[C],0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0
3,D,[D],0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,E,[E],0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,F,[F],0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0
6,G,[G],0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
7,H,[H],0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
8,I,[I],0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,J,[J],0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
