# Dealing With Categorical Variables 

Categorical variables are variables that represent data with distinct, finite categories or labels. These variables typically describe qualities or characteristics that do not have a meaningful numerical relationship, unlike continuous or ordinal variables. Categorical data can take on a limited number of values, each representing a different category or group.

There are two main types of categorical variables:

    Nominal variables: These represent categories without any intrinsic order or ranking. Examples include:
        Gender (Male, Female, Other)
        Color (Red, Blue, Green)
        Country (USA, Canada, UK)

    Ordinal variables: These represent categories with a meaningful order or ranking, but the intervals between the categories may not be equal. Examples include:
        Education level (High School, Bachelor's, Master's, PhD)
        Rating scale (Poor, Fair, Good, Excellent)
        Class levels (Freshman, Sophomore, Junior, Senior)

In [203]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns 
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression


## Load the data set 

In [204]:
df =  pd.read_csv("./data/auto-mpg.csv")
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


## Feature Engineering 

In [205]:
# create a new column called make



## |info

using info  to identify categorical variables 

In [206]:
# info


## Numeric Variables

### Numeric variables can be either continuous or discrete.

Continuous variables correspond to "real numbers" in mathematics, and floating point numbers in code. Essentially these variables can have any value on the number line, and usually have a decimal place in their code representation.

In [207]:
# example of numeric variable



Discrete numeric variables typically correspond to "whole numbers" in mathematics, and integers in code. These variables have gaps between their values.

In [208]:
# example of decrete numeric variable


## Categorical Variables

### Categorical variables can actually be strings or numbers.

String categorical variables will be fairly obvious due to their data type (object in pandas). For example, make is a categorical variable. It cannot be used in a scatter plot, and it will cause an error if you try to use it in a multiple regression model without additional transformations.

In [209]:
# example of a categorical variable 


In [210]:
# value counts


### plotting 

In [211]:
# plotting in built 


### seaborn

In [212]:
# plotting sns 


Discrete number categorical variables can be more difficult to spot. For example, origin is actually a categorical variable in this dataset, even though it is encoded as a number.

In [213]:
# decrete categorical variables counter 


In [214]:
# US to Europe, or Europe to Asia

## Transforming Categorical Variables with One-Hot Encoding

One-hot encoding is a technique used to convert categorical variables into a numerical format that can be used in machine learning models. It is particularly useful for handling nominal categorical variables, which do not have an inherent order or ranking.

In [215]:
# make a copy


In [216]:
# manual encoding


## using Get dummies


In [217]:
# using get dummies 


In [218]:
# research use category encoders 

## The Dummy Variable Trap

Due to the nature of how dummy variables are created, one variable can be predicted from all of the others. For example, if you know that origin_1 is 0 and origin_2 is 0, then you already know that origin_3 must be 1.

In [219]:
# making predictions


This is known as perfect multicollinearity and it can be a problem for regression. Multicollinearity will be covered in depth later but the basic idea behind perfect multicollinearity is that you can perfectly predict what one variable will be using some combination of the other variables.

When features in a linear regression have perfect multicollinearity due to the algorithm for creating dummy variables, this is known as the dummy variable trap.

### Avoiding the dummy variable trap 

In [220]:
# avoiding the dummy variable trap


## Multiple Regression with One-Hot Encoded Variables

In [221]:
# defining our x and y
y = df["mpg"]
X = df[["weight", "model year", "origin"]]
X

Unnamed: 0,weight,model year,origin
0,3504,70,1
1,3693,70,1
2,3436,70,1
3,3433,70,1
4,3449,70,1
...,...,...,...
387,2790,82,1
388,2130,82,2
389,2295,82,1
390,2625,82,1


### One hot encoding  

In [222]:
# one hot encoding 



## Modeling 

in stats models 

In [223]:
# ols model


### ploting 

In [224]:
# plotting in smgrephics  reg_ress_plot


## Scikit Learn 

In [225]:
# in Sckit learn 


### Compare results

In [226]:
def print_results(sk_model,ols_model):
    print(f"""


StatsModels intercept:    {ols_model.params["const"]}
scikit-learn intercept:   {sk_model.intercept_}

StatsModels coefficient:\n{ols_model.params}
scikit-learn coefficient: {sk_model.coef_}
""")

### Assignment

In [227]:
# build a model with make as a variable with make as a varible guide

# https://moringa.instructure.com/courses/895/assignments/63678?module_item_id=144477