<a href="https://colab.research.google.com/github/nmuonko/my_portfolio/blob/main/Predictive%20Tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This code defines a Flask app that allows users to upload CSV files, which are saved to the uploads directory and parsed using the Python csv module. The parsed data is then sent to Elasticsearch via Logstash, and the top 10 log messages within a specified time range are displayed in the results.html template.

The send_to_logstash() function reads in the CSV file using Python's csv.reader() function, and creates a dictionary for each row of data using the column headers as keys. The dictionary is then sent to Logstash using the logstash library.

The get_results() function connects to Elasticsearch and retrieves the top 10 log messages within a specified time range using a search query. Note that because the data is now in a dictionary format, the get_results() function returns the entire dictionary for each log message, not just the message string.

With this code, users can easily upload and analyze CSV log data, and identify potential issues or failures in their log data

In [None]:
from flask import Flask, render_template, request
import os
import json
import socket
import logstash
from elasticsearch import Elasticsearch
import csv

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/upload', methods=['POST'])
def upload():
    # get uploaded file
    file = request.files['file']

    # save file to disk
    file_path = os.path.join('uploads', file.filename)
    file.save(file_path)

    # parse file and send data to Elasticsearch via Logstash
    send_to_logstash(file_path)

    # display results in web UI
    results = get_results()
    return render_template('results.html', results=results)

def send_to_logstash(file_path):
    # connect to Logstash
    logger = logstash.TCPLogstashHandler('localhost', 5044, version=1)

    # read in CSV file
    with open(file_path, 'r') as f:
        csv_reader = csv.reader(f)
        headers = next(csv_reader)
        for row in csv_reader:
            log_dict = dict(zip(headers, row))
            log_dict['source'] = file_path
            log_dict['host'] = socket.gethostname()
            logger.emit(json.dumps(log_dict))

def get_results():
    # connect to Elasticsearch
    es = Elasticsearch()

    # specify index name and time range
    index_name = 'logstash-*'
    start_time = 'now-7d'
    end_time = 'now'

    # define search query
    query = {
        "query": {"range": {"@timestamp": {"gte": start_time, "lte": end_time}}},
        "size": 10000
    }

    # execute search query and get results
    res = es.search(index=index_name, body=query)

    # return top 10 log messages
    return [hit['_source'] for hit in res['hits']['hits']][:10]

if __name__ == '__main__':
    app.run(debug=True)


**Machine learning-based predictive analytics tool**: I first read in the data from a CSV file using pandas. We then split the data into training and test sets using the train_test_split() function from scikit-learn. Next, we select a support vector machine (SVM) algorithm for our predictive model and use scikit-learn's Pipeline and GridSearchCV classes to tune the hyperparameters of the SVM.

I evaalute the model's performance using cross-validation and the classification_report() function. Finally, we deploy the trained model in a production environment and monitor its performance over time.

The below code is an example of how to build a machine learning-based predictive analytics tool.  This can help organizations to better understand and manage their data, and to make more informed decisions about potential failures or issues.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report
import pandas as pd

# Step 1: Data Collection and Preparation
data = pd.read_csv('data.csv')
X = data.drop('label', axis=1)
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Feature Engineering
# Perform feature engineering as necessary

# Step 3: Model Selection
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])
param_grid = {
    'svm__C': [0.1, 1, 10],
    'svm__kernel': ['linear', 'rbf', 'poly']
}
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best parameters: ", grid_search.best_params_)

# Step 4: Model Tuning
# Perform model tuning as necessary

# Step 5: Model Training and Evaluation
clf = SVC(C=grid_search.best_params_['svm__C'], kernel=grid_search.best_params_['svm__kernel'])
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

# Step 6: Model Deployment and Monitoring
# Deploy the trained model in a production environment and monitor its performance over time
