Handling Categorized data
---

* use OrdinalEncoder class

            from Sklearn.preprocessing import OrdinalEncoder
            ordinal_encoder =  OrdinalEncoder()
            dataset_encoded = ordinal_encoder.fit_transform(dataset)
            ordinal_encoder.categories_
* use One-hot Encoding

            from Sklearn.preprocessing import OneHotEncoder
            cat_encoder = OneHotEncoder()
            dataset_encoded = cat_encoder.fit_transform(dataset)
            cat_encoder.categories_
   * Notice that the output is a SciPy sparse matrix, instead of a NumPy array so instead a sparse matrix only stores the location of the nonzero elements. 
   You can use it mostly like a normal 2D array,21 but if you really want to convert it to a (dense) NumPy array, just call the **toarray()** method
            
            
* use Embedding            
  If a categorical attribute has a large number of possible categories (e.g., country code, profession, species), then one-hot encoding will result in a large number of input features. This may slow down training and degrade performance.       
   Alternatively, you could replace each category with a learnable, low-dimensional vector called an embedding.
   
   
   Custom Transformer
   ---
   You will want your transformer to work seamlessly with Scikit-Learn functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not inheritance), all you need to do is create a class and implement three methods: **fit() (returning self)**, **transform()**, and **fit_transform()**.

Feature Scaling
---
* min-max method :

    Simply shift values so they endup being in range 0 to 1. 
              
        from sklearn MinMaxScaler, 
    
        *Feature_range* hyperparameter which lets you change the range.
* Standardization 
    
    Substract by mean. Divide by svd -> result has mean 0 and variance 1.
    
    ** standardization does not bound values to a specific range, which may be a problem for some algorithms. but is less affected by outliers.
            
         from sklearn StandardScalar
    ---     

### **Fit any transformer only on training data then transform training/test data**      

---
Pipeline
---
Scikit-Learn provides the Pipeline class to help with such sequences of transformations.
        
        from sklearn.pipeline import Pipeline
        num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
          ])
* **ColumnTransformer** It would be more convenient to have a single transformer able to handle all columns, applying the appropriate transformations to each column.
            
        from sklearn.compose import ColumnTransformer
        ull_pipeline = ColumnTransformer([
        ("name", transformer_func | "passthrough" | "drop", list of columns),
        ("cat", OneHotEncoder(), cat_attribs),
        ])
This applies each transformer to the appropriate columns and concatenates the outputs along the second axis.
If the output of one transformer is sparse matrix while the other one is dense matrix, the ColumnTransformer estimates the density of the final matrix, (i.e., the ratio of nonzero cells), and it returns a sparse matrix if the density is lower than a given threshold (by
default, sparse_threshold=0.3).



In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)

from sklearn.ensemble import RandomForestRegressor

**Random Forests** work by training many Decision Trees on random subsets of
the features, then averaging out their predictions. 
Building a model on top of many other models is called **Ensemble Learning**, and it is often a great way to push ML algorithms
even further.


The main ways to fix underfitting are to select a more
powerful model, to feed the training algorithm with better features, or to reduce the
constraints on the model.

Possible solutions for overfitting are
to simplify the model, constrain it (i.e., regularize it), or get a lot more training data.

Better Evaluation Using Cross-Validation
---

Scikit-Learn’s K-fold cross-validation: randomly splits the training set into 10 distinct subsets called folds, then it
trains and evaluates the Decision Tree model 10 times, picking a different fold for
evaluation every time and training on the other 9 folds. The result is an array containing
the 10 evaluation scores:

# Notice!!
Scikit-Learn’s cross-validation features expect a utility function
(greater is better) rather than a cost function (lower is better), so
the scoring function is actually the opposite of the MSE (i.e., a negative
value), which is why the preceding code computes -scores
before calculating the square root.


Cross-validation allows you to get not only an estimate of the performance of your model, but also a measure
of how precise this estimate is (i.e., its standard deviation). The Decision Tree has a
score of approximately 71,407, generally ±2,439. You would not have this information
if you just used one validation set. But cross-validation comes at the cost of training
the model several times, so it is not always possible.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

# Saving the model
To save your sklearn model you can use python's pickle or the **joblib library**, which is more efficient at serializing large NumPy arrays.



In [None]:
import joblib
joblib.dump(my_model, "my_model.pkl")
# and later...
my_model_loaded = joblib.load("my_model.pkl")

Fine-tune Model
---

* Grid Search
