# MACHINE LEARNING ALGORITHMS


# The Iris dataset is a famous and commonly used dataset in machine learning and statistics. It was introduced by the British biologist and statistician Ronald A. Fisher in 1936. This dataset is often used as a benchmark for various classification algorithms and tasks. It contains measurements of four features of three different species of iris flowers: Iris setosa, Iris versicolor, and Iris virginica. The four features are:

1. **Sepal Length**: The length of the iris flower's sepal (the green, leaf-like structure that surrounds the petals).

2. **Sepal Width**: The width of the iris flower's sepal.

3. **Petal Length**: The length of the iris flower's petal (the colorful part of the flower).

4. **Petal Width**: The width of the iris flower's petal.

Each of these features is measured in centimeters. The goal of using the Iris dataset is typically to classify iris flowers into one of the three species based on these four feature measurements.

Here's how you can load the Iris dataset in Python using scikit-learn:

```python
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()

# Access the feature matrix (X) and target labels (y)
X = iris.data  # Feature matrix
y = iris.target  # Target labels (species)
```

After loading the dataset, you'll have access to `X`, which is a NumPy array containing the feature measurements, and `y`, which is a NumPy array containing the target labels (0 for Iris setosa, 1 for Iris versicolor, and 2 for Iris virginica).

The Iris dataset is often used for tasks such as classification, clustering, and dimensionality reduction, making it a widely recognized dataset in the field of machine learning for educational and benchmarking purposes.

# IRIS DATASET

In [1]:
from sklearn import svm
from sklearn import datasets

In [2]:
iris=datasets.load_iris()

In [3]:
 type(iris)

sklearn.utils._bunch.Bunch

# description of the sklearn.utils._bunch.Bunch
`sklearn.utils.Bunch` is a class in scikit-learn (often abbreviated as sklearn), which is a popular machine learning library in Python. This class is used to represent datasets in scikit-learn in a convenient and consistent way.

A `Bunch` object is essentially a dictionary-like container that holds various data attributes. It is commonly used to store datasets that are typically used for machine learning tasks. The attributes within a `Bunch` object typically include the following:

1. `data`: This attribute stores the actual data matrix or array, often as a NumPy array or a Pandas DataFrame. Each row of this matrix represents an individual data sample, and each column represents a feature.

2. `target`: This attribute stores the target variable or labels associated with the data samples. It is often a NumPy array or a Pandas Series. In supervised learning tasks, this is the variable you aim to predict or classify.

3. `feature_names`: An optional attribute that contains the names of the features or columns in the `data` attribute. This is useful for keeping track of the feature names when working with datasets.

4. `target_names`: An optional attribute that contains the names or labels for the classes in the `target` attribute. This is commonly used in classification tasks to label the different classes.

5. `DESCR`: An optional attribute that provides a description or information about the dataset.

`Bunch` objects are convenient for packaging and passing datasets because they provide a standardized way to access and manipulate the data and associated metadata. You can access these attributes using dot notation, like `bunch.data` or `bunch.target`, making it easy to work with data in scikit-learn.

Here's an example of creating a `Bunch` object to represent a simple dataset:

```python
from sklearn.datasets import Bunch

# Create a Bunch object to represent a dataset
data = Bunch(
    data=[[1, 2], [3, 4], [5, 6]],
    target=[0, 1, 0],
    feature_names=['feature1', 'feature2'],
    target_names=['class0', 'class1'],
    DESCR="A sample dataset for illustration."
)

# Accessing data attributes
print(data.data)
print(data.target)
print(data.feature_names)
print(data.target_names)
print(data.DESCR)
```

In practice, `sklearn.utils.Bunch` is often used when loading and working with datasets from scikit-learn's built-in datasets, making it easier for users to access and manipulate the data in a consistent manner.

In [4]:
#printting elements of the iris dataset
iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [5]:
#getting information about the data
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [6]:
#getting our values for prediction
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [7]:
#getting the target names
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [8]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

# DATA MODELLING

In [9]:
#extractring the third column of the iris dataset
X=iris.data[:,2]

In [10]:
y=iris.target

In [11]:
#from sklearn.crossvalidation import train_test_split
from sklearn.model_selection import train_test_split, cross_val_score


In [12]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=4)

# DATA RESHAPING

# In the context of machine learning, data reshaping refers to the process of changing the structure or organization of your dataset. This can involve converting data from one format to another or changing the arrangement of data points to make it suitable for a specific machine learning algorithm or task. Data reshaping is an important step in data preprocessing and can include various operations, depending on the nature of your data and the requirements of your machine learning model. Here are some common scenarios in which data reshaping is necessary:

1. **Changing Data Structure**:
   
   - **Converting from Lists/Arrays to DataFrames**: You might have your data stored in lists or NumPy arrays, and you want to convert it to a Pandas DataFrame for easier manipulation and analysis.
   
   - **Converting from DataFrames to Arrays**: Conversely, you might need to convert a DataFrame into arrays to use it with certain machine learning libraries or algorithms.

2. **Reshaping for Specific Algorithms**:

   - **Flattening Images**: For image classification tasks, you may need to flatten multi-dimensional image data (e.g., converting 2D images into 1D arrays) to use them with traditional machine learning algorithms like logistic regression.

   - **Changing Input Shape**: Some machine learning algorithms, such as Convolutional Neural Networks (CNNs), require specific input shapes (e.g., height x width x channels for images). You may need to reshape your data to match these requirements.

3. **Time Series Data**:

   - **Sliding Windows**: When working with time series data, you might need to create sliding windows of data for input to recurrent neural networks (RNNs) or other sequence models.

4. **One-Hot Encoding**:

   - **Converting Categorical Variables**: Categorical variables often need to be one-hot encoded, where each category becomes a binary column. This reshaping is essential when using categorical data in machine learning models.

5. **Stacking or Merging Data**:

   - **Stacking DataFrames**: You may have data split across multiple DataFrames or files, and you need to stack or merge them together to create a single dataset for analysis.

6. **Reshaping for Cross-Validation**:

   - **Preparing Data for Cross-Validation**: When performing k-fold cross-validation, you may need to split your data into training and validation sets for each fold. This requires reshaping your data multiple times.

7. **Aggregating Data**:

   - **Aggregating Time Series Data**: In time series analysis, you might need to aggregate data at different time intervals (e.g., daily to monthly) for modeling.

The specific reshaping operation you need to perform depends on your data and the machine learning task you're working on. Libraries like NumPy, Pandas, and scikit-learn provide a wide range of functions and methods to help you reshape your data to suit your needs. The goal of data reshaping is to prepare your data in a format that allows your machine learning model to learn from it effectively.

In [13]:
    X_train_mod=X_train.reshape(-1,1)

In [14]:
X_test_mod=X_test.reshape(-1,1)

In [15]:
model=svm.SVC(kernel='linear')

In [16]:
y_train_mod=y_train.reshape(-1,1)

In [17]:
y_test_mod=y_test.reshape(-1,1)

# It appears that you're creating a Support Vector Machine (SVM) classifier using scikit-learn's `SVC` class with a linear kernel. Here's a breakdown of what this code does:

1. Import scikit-learn's SVM module:

   Before using the `SVC` class, make sure to import it:

   ```python
   from sklearn import svm
   ```

2. Create an SVM classifier with a linear kernel:

   The code you provided creates an SVM classifier object called `model` with a linear kernel. The linear kernel is one of the kernel functions used in SVM, and it is suitable for linearly separable datasets. In simple terms, it's used when you expect your data to have a linear boundary that separates different classes.

   ```python
   model = svm.SVC(kernel='linear')
   ```

   You can replace `'linear'` with other kernel types like `'rbf'` (Radial Basis Function), `'poly'` (Polynomial), etc., depending on the characteristics of your data and the problem you are trying to solve. Each kernel type has its own advantages and may perform better on different types of data.

3. Train the SVM model:

   To train the SVM model, you'll typically use your training data (feature matrix `X_train` and target variable `y_train`). Here's an example of how you can train the model:

   ```python
   model.fit(X_train, y_train)
   ```

   - `X_train`: Your training set features.
   - `y_train`: Your training set target labels.

4. Make predictions with the trained model:

   After training, you can use the trained `model` to make predictions on new, unseen data (feature matrix `X_test`). Here's how you can do it:

   ```python
   y_pred = model.predict(X_test)
   ```

   - `X_test`: Your testing set features.
   - `y_pred`: The predicted labels for your testing set based on the trained model.

This code snippet sets up an SVM classifier with a linear kernel, but for practical use, you would typically load your own dataset, split it into training and testing sets, and then train and evaluate the model using your data.

In [18]:
model.fit(X_train_mod,y_train_mod)

  y = column_or_1d(y, warn=True)


In [19]:
y_pred_mod=model.predict(X_test_mod)

In [20]:
from sklearn.metrics import accuracy_score

In [21]:
print(accuracy_score(y_test_mod,y_pred_mod))

0.9666666666666667


# DATA MODELLING USING THE KNN MODEL

In [22]:
X=iris.data
y=iris.target

In [23]:
X.shape

(150, 4)

In [24]:
y.shape

(150,)

In [25]:
from sklearn.neighbors import KNeighborsClassifier


# The k-Nearest Neighbors (KNN) algorithm is a simple yet effective machine learning algorithm used for both classification and regression tasks. Its fundamental principle is based on the idea that data points with similar features tend to belong to the same class or have similar target values. KNN makes predictions by finding the K training data points that are nearest to a new, unseen data point and then using majority voting (for classification) or averaging (for regression) to determine the prediction. Here's how KNN works step by step:

1. **Initialization**:
   
   - Choose the number of neighbors, K, which is a hyperparameter that you need to specify in advance. K represents the number of nearest neighbors to consider when making predictions.

2. **Training**:

   - The KNN algorithm stores the entire training dataset, which consists of feature vectors and their corresponding labels (for classification) or target values (for regression).

3. **Prediction**:

   - Given a new, unseen data point (a feature vector), the algorithm identifies the K training data points that are closest to the new data point based on a distance metric (commonly Euclidean distance).

4. **Majority Voting (Classification)**:

   - For classification tasks, KNN performs majority voting among the K nearest neighbors. It counts the number of neighbors belonging to each class and assigns the class with the highest count as the predicted class for the new data point.

   - The choice of class label is based on the class labels of the majority of the K neighbors. In case of a tie, the algorithm can use different strategies, such as weighting the votes based on the distance to break the tie.

5. **Averaging (Regression)**:

   - For regression tasks, KNN calculates the average (or weighted average) of the target values of the K nearest neighbors. This average becomes the predicted target value for the new data point.

   - If weighted averaging is used, closer neighbors may have a greater influence on the prediction than those farther away. Weights can be based on the distance or other factors.

6. **Output**:

   - KNN produces the final prediction for the new data point, either a class label (for classification) or a numerical value (for regression).

Key Considerations and Notes:

- The choice of distance metric (commonly Euclidean, Manhattan, or Minkowski distance) can significantly impact the results, so it's essential to choose an appropriate metric for your data.

- The value of K is a critical hyperparameter that influences the algorithm's performance. A smaller K might lead to overfitting, while a larger K may introduce bias.

- KNN is a non-parametric algorithm, meaning it does not make any assumptions about the underlying data distribution. It can work well for both linear and non-linear data.

- KNN can be sensitive to the scaling of features, so it's often a good practice to standardize or normalize your data.

- The computational complexity of KNN can be relatively high, especially for large datasets, as it requires calculating distances to all data points in the training set. This can be mitigated with efficient data structures like KD-trees or ball trees.

- KNN is a lazy learner, meaning it doesn't build an explicit model during training. Instead, it memorizes the training data and makes predictions on-the-fly during testing.

- KNN's performance can be influenced by the choice of K, the distance metric, and the handling of ties. Experimentation and cross-validation are often used to determine suitable settings for these parameters.

# The code you provided creates a K-nearest neighbors (KNN) classifier with `n_neighbors=1` and assigns it to the variable `knn`. The line `knn` by itself is simply a reference to the KNN classifier object you created. This allows you to use the `knn` variable to train the classifier, make predictions, and perform other operations.

Here's how you can use the `knn` classifier object to train it and make predictions:

```python
from sklearn.neighbors import KNeighborsClassifier

# Create a KNN classifier with k=1
knn = KNeighborsClassifier(n_neighbors=1)

# Assuming you have training data X_train and corresponding labels y_train
knn.fit(X_train, y_train)  # Train the KNN classifier

# Assuming you have test data X_test
y_pred = knn.predict(X_test)  # Make predictions on the test data
```

In this example:

- `knn.fit(X_train, y_train)` trains the KNN classifier on the training data (`X_train` features and `y_train` labels).

- `knn.predict(X_test)` uses the trained classifier to make predictions on new, unseen data (`X_test` features), and the results are stored in the `y_pred` variable.

You can then use `y_pred` to evaluate the model's performance or for other downstream tasks. Adjusting the value of `n_neighbors` will affect the number of neighbors considered when making predictions, and you can experiment with different values to find the best setting for your specific machine learning problem.

In [26]:
knn=KNeighborsClassifier(n_neighbors=1)
knn

In [27]:
knn.fit(X,y)

In [28]:
import numpy as np
a=np.array([4,5,6,2])
a.shape

(4,)

In [29]:
knn.predict([a])


array([2])

# LOGISTIC REGRESSION

In [30]:
from sklearn.linear_model import LogisticRegression

In [31]:
logReg=LogisticRegression()
logReg.fit(X,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [32]:
logReg.predict([a])

array([2])