# Cricket Analysis and Prediction using ML.Net

<img src=".\assets\cricket-banner.jpg" alt="Cricket" style="zoom:80%;margin:auto;">

<br>
<span style="color:red"><b>Disclaimer:</b> The analysis and prediction done here is for learning purpose only and should not be used for any illegal activities such as betting.</span>.

## Introduction

Cricket, a game of bat and ball is one of the most popular games and played in varied formats. It's a game of numbers with each match generating a plethora of data about players and matches. This data is used by analysts and data scientists to uncover meaningful insights and forecast about matches and players performance. In this session, I'll be performing some analytics and prediction on the cricket data using Microsoft ML.Net framework and C#.

<img src=".\assets\cricket-history.png" alt="Cricket" style="zoom:80%;margin:auto;">

## Problem Statement

Analyze the cricket dataset and predict the score after 6 overs for a match.

## Dataset

- Source: [CricSheet](https://cricsheet.org/) > [Download T20 Dataset](https://cricsheet.org/downloads/t20s_male_csv2.zip)


### Exploratory Data Analysis - EDA

- No of Features: 22
- Total Records: 194K
- Features have a mix of date, number and string
- Contains Null Values

In [2]:
#!about

0,1
,.NET Interactive© 2020 Microsoft CorporationVersion: 1.0.230701+897ec27256aa312cc872b52b261726684b29d42bBuild date: 2021-06-09T11:13:17.2992510Zhttps://github.com/dotnet/interactive


### 1. Define Application wide Items

#### Nuget Packages

In [3]:
// ML.NET Nuget packages installation
#r "nuget:Microsoft.Data.Analysis"

Installed package Microsoft.Data.Analysis version 0.4.0

#### Namespaces

In [4]:
using Microsoft.Data.Analysis;
using Microsoft.AspNetCore.Html;
using Microsoft.DotNet.Interactive.Formatting;
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using System.Net.Http;
using System.IO;
using System.IO.Compression;

#### Constants

In [5]:
// File
const string DATASET_DIRECTORY = "t20s_male_csv2";
const string DATASET_FILE = "t20s_male_csv2.zip";
const string DATASET_URL = "https://cricsheet.org/downloads/";
const string FINAL_CSV = "t20_final.csv";

// DataFrame/Table
const int TOP_COUNT = 5;
const int DEFAULT_ROW_COUNT = 10;

### 2. Utility Functions

#### Merges multiple CSV files present in specified directory

In [6]:
public void MergeCsv(string sourceFolder, string destinationFile)
{
    /*
     https://chris.koester.io/index.php/2017/01/27/combine-csv-files/
     C# script combines multiple csv files without duplicating headers.
     Combining 8 files with a combined total of about 9.2 million rows 
     took about 3.5 minutes on a network share and 44 seconds on an SSD.
    */

    // Specify wildcard search to match CSV files that will be combined
    string[] filePaths = Directory.GetFiles(sourceFolder, "*.csv");
    StreamWriter fileDest = new StreamWriter(destinationFile, true);

    int i;
    for (i = 0; i < filePaths.Length; i++)
    {
        string file = filePaths[i];

        string[] lines = File.ReadAllLines(file);

        if (i > 0)
        {
            lines = lines.Skip(1).ToArray(); // Skip header row for all but first file
        }

        foreach (string line in lines)
        {
            fileDest.WriteLine(line);
        }
    }

    fileDest.Close();
}

#### Formatters

By default the output of DataFrame is not proper and in order to display it as a table, we need to have a custom formatter implemented as shown in next cell. 

In [7]:
// Formats the table
Formatter.Register(typeof(DataFrame),(dataFrame, writer) =>
{
    var df = dataFrame as DataFrame;
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent)th(c.Name)));
    var rows = new List<List<IHtmlContent>>();
    var take = 10;
    for (var i = 0; i < Math.Min(take, df.Rows.Count); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(i));
        foreach (var obj in df.Rows[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }

    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));

    writer.Write(t);
}, "text/html");

### 2. Load Dataset

The Dataset present in Cricksheet is in compressed zip format. Internally it contains csv file that we will be using for our analysis and prediction.

#### 2.a Cleans Previous Data

In [8]:
/// <summary>
/// Cleans previous data present in current working directory
/// </summary>
public void CleanPreviousData()
{
    if (File.Exists(DATASET_FILE))
    {
        File.Delete(DATASET_FILE);
    }

    if (File.Exists(FINAL_CSV))
    {
        File.Delete(FINAL_CSV);
    }

    if (Directory.Exists(DATASET_DIRECTORY))
    {
        Directory.Delete(DATASET_DIRECTORY, true);
    }
}

#### 2.b Loads the Dataset and returns a DataFrame

In [9]:
/// <summary>
/// Loads the dataset from specified URL
/// </summary>
/// <param name="url">Remote URL</param>
/// <param name="fileName">Name of the file</param>
/// <returns>A DataFrame</returns>
public async Task<DataFrame> LoadDatasetAsync(string url, string fileName)
{
    // Delete previous data
    CleanPreviousData();

    // Loads zip file from remote URL
    var remoteFilePath = Path.Combine(url, fileName);
    using (var httpClient = new HttpClient())
    {
        var contents = await httpClient.GetByteArrayAsync(remoteFilePath);
        await File.WriteAllBytesAsync(fileName, contents);
    }

    // Unzip file -> Merge CSV -> Load to DataFrame
    if (File.Exists(fileName))
    {
        var extractedDirectory = Path.Combine(Directory.GetCurrentDirectory(), DATASET_DIRECTORY);

        try
        {
            ZipFile.ExtractToDirectory(fileName, extractedDirectory);
            MergeCsv(extractedDirectory, FINAL_CSV);

            return DataFrame.LoadCsv(FINAL_CSV);
        }
        catch (Exception e)
        {
            display(e);
            throw;
        }

    }

    return new DataFrame();
}

In [10]:
var cricketDataFrame = await LoadDatasetAsync(DATASET_URL, DATASET_FILE);

#### Filtering

As we are considering data for initial 6 overs of a match, we need to remove data for further overs

In [13]:
var filterColumn = cricketDataFrame.Columns["ball"].ElementwiseLessThanOrEqual(6.0);
var sixOverDataFrame = cricketDataFrame.Filter(filterColumn);

### 3. Data Analysis

In [14]:
sixOverDataFrame.Head(TOP_COUNT)

index,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,bowler,runs_off_bat,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed
0,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.1,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,0,0,,,,,,,,,
1,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.2,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,0,0,,,,,,,,,
2,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.3,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,1,0,,,,,,,,,
3,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.4,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,2,0,,,,,,,,,
4,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.5,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,0,0,,,,,,,,,


In [15]:
sixOverDataFrame.Info()

index,Info,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,bowler,runs_off_bat,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed
0,DataType,System.Single,System.String,System.DateTime,System.String,System.Single,System.Single,System.String,System.String,System.String,System.String,System.String,System.Single,System.Single,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String
1,Length (excluding null values),75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244


In [16]:
sixOverDataFrame.Description()

index,Description,match_id,start_date,innings,ball,runs_off_bat,extras
0,Length (excluding null values),75244,75244,75244.0,75244.0,75244.0,75244.0
1,Max,1263713,<null>,4.0,5.9,6.0,5.0
2,Min,211028,<null>,1.0,0.1,0.0,0.0
3,Mean,879266,<null>,1.5008107,2.8433447,1.1237973,0.08085695


## Problem statement
EDA
 - Data Cleaning
 - Analysis of Dataset
 - Visualizations

Prediction
 - Algorithm(Linear Regression)
 - Parameter Tuning
 - Better algorithm
 - Accuracy/Loss
 - Model Builder
 