# Cricket Analysis and Prediction using ML.Net

<img src=".\assets\cricket-banner.jpg" alt="Cricket" style="zoom:80%;margin:auto;">

<br>
<span style="color:red"><b>Disclaimer:</b> The analysis and prediction done here is for learning purpose only and should not be used for any illegal activities such as betting.</span>.

## Introduction

Cricket, a game of bat and ball is one of the most popular games and played in varied formats. It's a game of numbers with each match generating a plethora of data about players and matches. This data is used by analysts and data scientists to uncover meaningful insights and forecast about matches and players performance. In this session, I'll be performing some analytics and prediction on the cricket data using Microsoft ML.Net framework and C#.

<img src="assets\cricket-history.png" alt="Cricket" style="zoom:80%;margin:auto;">

## Problem Statement

Analyze the cricket dataset and predict the score after 6 overs for a match.

## Dataset

- Source: [CricSheet](https://cricsheet.org/) > [Download T20 Dataset](https://cricsheet.org/downloads/t20s_male_csv2.zip)


### Exploratory Data Analysis - EDA

- No of Features: 22
- Total Records: 194K
- Features have a mix of date, number and string
- Contains Null Values

In [3]:
#!about

0,1
,.NET Interactive© 2020 Microsoft CorporationVersion: 1.0.230701+897ec27256aa312cc872b52b261726684b29d42bBuild date: 2021-06-09T11:13:17.2992510Zhttps://github.com/dotnet/interactive


### 1. Define Application wide Items

#### Nuget Packages

In [137]:
// ML.NET Nuget packages installation
#r "nuget:Microsoft.ML"
#r "nuget:Microsoft.ML.FastTree"    
#r "nuget:Microsoft.Data.Analysis"
#r "nuget:Daany.DataFrame"
#r "nuget:CsvHelper"

Installed package Microsoft.ML.FastTree version 1.5.5

#### Namespaces

In [128]:
using CsvHelper;
using CsvHelper.Configuration;
using Daany;
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.Data.Analysis;
using Microsoft.AspNetCore.Html;
using Microsoft.DotNet.Interactive.Formatting;
using static Microsoft.DotNet.Interactive.Formatting.PocketViewTags;
using System.Net.Http;
using System.IO;
using System.IO.Compression;
using System.Globalization;

#### Constants

In [141]:
// File
const string DATASET_DIRECTORY = "t20s_male_csv2";
const string DATASET_FILE = "t20s_male_csv2.zip";
const string DATASET_URL = "https://cricsheet.org/downloads/";
const string DATASET_MERGED_CSV = "t20_merged.csv";
const string DATASET_ALL_CSV = "t20_all.csv";
const string DATASET_CLEANED_CSV = "t20_cleaned.csv";
const string MODEL_FILE = "model.zip";

// DataFrame/Table
const int TOP_COUNT = 5;
const int DEFAULT_ROW_COUNT = 10;

// Columns
const string CSV_COLUMN_BALL = "ball";
const string CSV_COLUMN_SCORE = "score";
const string CSV_COLUMN_RUNS_OFF_BAT = "runs_off_bat";
const string CSV_COLUMN_EXTRAS = "extras";
const string CSV_COLUMN_TOTAL_SCORE = "total_score";
const string CSV_COLUMN_VENUE = "venue";
const string CSV_COLUMN_INNINGS = "innings";
const string CSV_COLUMN_BATTING_TEAM = "batting_team";
const string CSV_COLUMN_BOWLING_TEAM = "bowling_team";
const string CSV_COLUMN_STRIKER = "striker";
const string CSV_COLUMN_NON_STRIKER = "non_striker";
const string CSV_COLUMN_BOWLER = "bowler";

// Misc
const float OVER_THRESHOLD = 6.0f;

### 2. Utility Functions

#### Merges multiple CSV files present in specified directory

In [92]:
public void MergeCsv(string sourceFolder, string destinationFile)
{
    /*
     https://chris.koester.io/index.php/2017/01/27/combine-csv-files/
     C# script combines multiple csv files without duplicating headers.
     Combining 8 files with a combined total of about 9.2 million rows 
     took about 3.5 minutes on a network share and 44 seconds on an SSD.
    */

    // Specify wildcard search to match CSV files that will be combined
    string[] filePaths = Directory.GetFiles(sourceFolder, "*.csv");
    StreamWriter fileDest = new StreamWriter(destinationFile, true);

    int i;
    for (i = 0; i < filePaths.Length; i++)
    {
        string file = filePaths[i];

        string[] lines = File.ReadAllLines(file);

        if (i > 0)
        {
            lines = lines.Skip(1).ToArray(); // Skip header row for all but first file
        }

        foreach (string line in lines)
        {
            fileDest.WriteLine(line);
        }
    }

    fileDest.Close();
}

#### Formatters

By default the output of DataFrame is not proper and in order to display it as a table, we need to have a custom formatter implemented as shown in next cell. 

In [93]:
// Formats the table
Formatter.Register(typeof(Microsoft.Data.Analysis.DataFrame),(dataFrame, writer) =>
{
    var df = dataFrame as Microsoft.Data.Analysis.DataFrame;
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent)th(c.Name)));
    var rows = new List<List<IHtmlContent>>();
    var take = 10;
    for (var i = 0; i < Math.Min(take, df.Rows.Count); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(i));
        foreach (var obj in df.Rows[i])
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }

    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));

    writer.Write(t);
}, "text/html");

In [153]:
// Formats the Danny DataFrame
/* Formatter.Register(typeof(Daany.DataFrame), (dataFrame, writer) =>
{
    var df = dataFrame as Daany.DataFrame;
    var headers = new List<IHtmlContent>();
    headers.Add(th(i("index")));
    headers.AddRange(df.Columns.Select(c => (IHtmlContent)th(c)));
    var rows = new List<List<IHtmlContent>>();
    var take = 10;
    for (var i = 0; i < Math.Min(take, df.RowCount()); i++)
    {
        var cells = new List<IHtmlContent>();
        cells.Add(td(i));
        foreach (var obj in df.GetRowEnumerator())
        {
            cells.Add(td(obj));
        }
        rows.Add(cells);
    }

    var t = table(
        thead(
            headers),
        tbody(
            rows.Select(
                r => tr(r))));

    writer.Write(t);
}, "text/html"); */

### 2. Load Dataset

The Dataset present in Cricksheet is in compressed zip format. Internally it contains csv file that we will be using for our analysis and prediction.

#### 2.a Cleans Previous Data

In [94]:
/// <summary>
/// Cleans previous data present in current working directory
/// </summary>
public void CleanPreviousData()
{
    if (File.Exists(DATASET_FILE))
    {
        File.Delete(DATASET_FILE);
    }

    if (File.Exists(DATASET_MERGED_CSV))
    {
        File.Delete(DATASET_MERGED_CSV);
    }
    
    if (File.Exists(DATASET_CLEANED_CSV))
    {
        File.Delete(DATASET_CLEANED_CSV);
    }
    
    if (Directory.Exists(DATASET_DIRECTORY))
    {
        Directory.Delete(DATASET_DIRECTORY, true);
    }
}

#### 2.b Loads the Dataset and returns a DataFrame

In [95]:
/// <summary>
/// Loads the dataset from specified URL
/// </summary>
/// <param name="url">Remote URL</param>
/// <param name="fileName">Name of the file</param>
/// <returns>A DataFrame</returns>
public async Task<Microsoft.Data.Analysis.DataFrame> LoadDatasetAsync(string url, string fileName)
{
    // Delete previous data
    CleanPreviousData();

    // Loads zip file from remote URL
    var remoteFilePath = Path.Combine(url, fileName);
    using (var httpClient = new HttpClient())
    {
        var contents = await httpClient.GetByteArrayAsync(remoteFilePath);
        await File.WriteAllBytesAsync(fileName, contents);
    }

    // Unzip file -> Merge CSV -> Load to DataFrame
    if (File.Exists(fileName))
    {
        var extractedDirectory = Path.Combine(Directory.GetCurrentDirectory(), DATASET_DIRECTORY);

        try
        {
            ZipFile.ExtractToDirectory(fileName, extractedDirectory);
            MergeCsv(extractedDirectory, DATASET_MERGED_CSV);

            return Microsoft.Data.Analysis.DataFrame.LoadCsv(DATASET_MERGED_CSV);
        }
        catch (Exception e)
        {
            display(e);
            throw;
        }

    }

    return new Microsoft.Data.Analysis.DataFrame();
}

In [96]:
var cricketDataFrame = await LoadDatasetAsync(DATASET_URL, DATASET_FILE);

#### Filtering

As we are considering data for initial 6 overs of a match, we need to remove data for further overs

In [97]:
var filterColumn = cricketDataFrame.Columns[CSV_COLUMN_BALL].ElementwiseLessThanOrEqual(OVER_THRESHOLD);
var sixOverDataFrame = cricketDataFrame.Filter(filterColumn);

### 3. Data Analysis

In [98]:
sixOverDataFrame.Head(TOP_COUNT)

index,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,bowler,runs_off_bat,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed
0,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.1,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,0,0,,,,,,,,,
1,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.2,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,0,0,,,,,,,,,
2,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.3,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,1,0,,,,,,,,,
3,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.4,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,2,0,,,,,,,,,
4,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.5,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,0,0,,,,,,,,,


In [99]:
sixOverDataFrame.Info()

index,Info,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,bowler,runs_off_bat,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed
0,DataType,System.Single,System.String,System.DateTime,System.String,System.Single,System.Single,System.String,System.String,System.String,System.String,System.String,System.Single,System.Single,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String
1,Length (excluding null values),75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244


In [100]:
sixOverDataFrame.Description()

index,Description,match_id,start_date,innings,ball,runs_off_bat,extras
0,Length (excluding null values),75244,75244,75244.0,75244.0,75244.0,75244.0
1,Max,1263713,<null>,4.0,5.9,6.0,5.0
2,Min,211028,<null>,1.0,0.1,0.0,0.0
3,Mean,879266,<null>,1.5008107,2.8433447,1.1237973,0.08085695


### Score and Total Score

The dataset contains various runs scored every ball such as runs_off_bat, extras, wides, legbyes etc. There is no column for runs scored per ball.
We could get it by adding 'runs_off_bat' and 'extras'.

**Score = runs_off_bat + extras**

Similary we can get total runs per ball by having a cumulative sum of runs per ball in a inning.

**Total Score = CumulativeSum(Score/inning)**

In [101]:
// Add a column for Total Score
sixOverDataFrame.Columns.Add(new PrimitiveDataFrameColumn<int>(CSV_COLUMN_SCORE, sixOverDataFrame.Rows.Count));
sixOverDataFrame.Columns[CSV_COLUMN_SCORE] = sixOverDataFrame.Columns[CSV_COLUMN_RUNS_OFF_BAT] + sixOverDataFrame.Columns[CSV_COLUMN_EXTRAS];

In [102]:
sixOverDataFrame.Info()

index,Info,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,bowler,runs_off_bat,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed,score
0,DataType,System.Single,System.String,System.DateTime,System.String,System.Single,System.Single,System.String,System.String,System.String,System.String,System.String,System.Single,System.Single,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.String,System.Single
1,Length (excluding null values),75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244,75244


In [103]:
sixOverDataFrame.Head(20)

index,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,bowler,runs_off_bat,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed,score
0,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.1,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,0,0,,,,,,,,,,0
1,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.2,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,0,0,,,,,,,,,,0
2,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.3,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,1,0,,,,,,,,,,1
3,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.4,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,2,0,,,,,,,,,,2
4,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.5,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,0,0,,,,,,,,,,0
5,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.6,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,3,0,,,,,,,,,,3
6,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,1.1,Australia,Sri Lanka,M Klinger,AJ Finch,KMDN Kulasekara,0,0,,,,,,,,,,0
7,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,1.2,Australia,Sri Lanka,M Klinger,AJ Finch,KMDN Kulasekara,1,0,,,,,,,,,,1
8,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,1.3,Australia,Sri Lanka,AJ Finch,M Klinger,KMDN Kulasekara,0,0,,,,,,,,,,0
9,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,1.4,Australia,Sri Lanka,AJ Finch,M Klinger,KMDN Kulasekara,0,0,,,,,,,,,,0


#### Total Score

In [104]:
public void AddTotalScorePerBall(Microsoft.Data.Analysis.DataFrame df)
{
    // calculate total_score per ball
    df.Columns.Add(new PrimitiveDataFrameColumn<Single>(CSV_COLUMN_TOTAL_SCORE, df.Rows.Count));
    Single previousMatchId = -1;
    Single previousInning = -1;
    Single previousScore = -1;

    foreach (var dfRow in df.Rows)
    {
        var matchId = (Single)dfRow[0];
        var inning = (Single)dfRow[4];
        var score = (Single)dfRow[df.Columns.Count - 2]; // Score

        if (previousMatchId == -1) // First time
        {
            // Reset
            previousMatchId = matchId;
            previousInning = inning;

            if (previousScore == -1)
            {
                previousScore = score;
            }
        }

        if (matchId == previousMatchId && inning == previousInning)
        {
            Single newScore = previousScore + score;

            // Total Score
            dfRow[df.Columns.Count - 1] = (Single)newScore;
            previousScore = newScore;
        }
        else
        {
            // Total Score
            dfRow[df.Columns.Count - 1] = (Single)score;
            
            // Reset
            previousMatchId = -1;
            previousInning = -1;
            previousScore = score;
        }
    }    
}

In [105]:
AddTotalScorePerBall(sixOverDataFrame);

In [106]:
display(sixOverDataFrame)

index,match_id,season,start_date,venue,innings,ball,batting_team,bowling_team,striker,non_striker,bowler,runs_off_bat,extras,wides,noballs,byes,legbyes,penalty,wicket_type,player_dismissed,other_wicket_type,other_player_dismissed,score,total_score
0,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.1,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,0,0,,,,,,,,,,0,0
1,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.2,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,0,0,,,,,,,,,,0,0
2,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.3,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,1,0,,,,,,,,,,1,1
3,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.4,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,2,0,,,,,,,,,,2,3
4,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.5,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,0,0,,,,,,,,,,0,3
5,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,0.6,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,3,0,,,,,,,,,,3,6
6,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,1.1,Australia,Sri Lanka,M Klinger,AJ Finch,KMDN Kulasekara,0,0,,,,,,,,,,0,6
7,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,1.2,Australia,Sri Lanka,M Klinger,AJ Finch,KMDN Kulasekara,1,0,,,,,,,,,,1,7
8,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,1.3,Australia,Sri Lanka,AJ Finch,M Klinger,KMDN Kulasekara,0,0,,,,,,,,,,0,7
9,1001349,2016/17,2017-02-17 00:00:00Z,Melbourne Cricket Ground,1,1.4,Australia,Sri Lanka,AJ Finch,M Klinger,KMDN Kulasekara,0,0,,,,,,,,,,0,7


### Feature Selection

There are 24 total columns in the dataset now. Not all of them are relevant for making prediction. There are different strategies available for Feature selection such as confusion matrix, seaborn plot which can be applied to select appropriate features. In order to keep things sinple, I have chosen few parameters after performing different combinations. The features selected are as follows.

**Input Features**
- venue
- innings
- ball
- batting_team
- bowling_team
- striker
- non_striker
- bowler

**Output**
- total_runs

In [107]:
// Remove extra features
sixOverDataFrame.Columns.Remove("match_id");
sixOverDataFrame.Columns.Remove("season");
sixOverDataFrame.Columns.Remove("start_date");
sixOverDataFrame.Columns.Remove("runs_off_bat");
sixOverDataFrame.Columns.Remove("extras");
sixOverDataFrame.Columns.Remove("wides");
sixOverDataFrame.Columns.Remove("noballs");
sixOverDataFrame.Columns.Remove("byes");
sixOverDataFrame.Columns.Remove("legbyes");
sixOverDataFrame.Columns.Remove("penalty");
sixOverDataFrame.Columns.Remove("player_dismissed");
sixOverDataFrame.Columns.Remove("other_wicket_type");
sixOverDataFrame.Columns.Remove("other_player_dismissed");
sixOverDataFrame.Columns.Remove("score");
sixOverDataFrame.Columns.Remove("wicket_type");

In [108]:
display(sixOverDataFrame);

index,venue,innings,ball,batting_team,bowling_team,striker,non_striker,bowler,total_score
0,Melbourne Cricket Ground,1,0.1,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,0
1,Melbourne Cricket Ground,1,0.2,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,0
2,Melbourne Cricket Ground,1,0.3,Australia,Sri Lanka,AJ Finch,M Klinger,SL Malinga,1
3,Melbourne Cricket Ground,1,0.4,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,3
4,Melbourne Cricket Ground,1,0.5,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,3
5,Melbourne Cricket Ground,1,0.6,Australia,Sri Lanka,M Klinger,AJ Finch,SL Malinga,6
6,Melbourne Cricket Ground,1,1.1,Australia,Sri Lanka,M Klinger,AJ Finch,KMDN Kulasekara,6
7,Melbourne Cricket Ground,1,1.2,Australia,Sri Lanka,M Klinger,AJ Finch,KMDN Kulasekara,7
8,Melbourne Cricket Ground,1,1.3,Australia,Sri Lanka,AJ Finch,M Klinger,KMDN Kulasekara,7
9,Melbourne Cricket Ground,1,1.4,Australia,Sri Lanka,AJ Finch,M Klinger,KMDN Kulasekara,7


In [109]:
sixOverDataFrame.Info()

index,Info,venue,innings,ball,batting_team,bowling_team,striker,non_striker,bowler,total_score
0,DataType,System.String,System.Single,System.Single,System.String,System.String,System.String,System.String,System.String,System.Single
1,Length (excluding null values),75244,75244,75244,75244,75244,75244,75244,75244,75244


In [110]:
sixOverDataFrame.Description()

index,Description,innings,ball,total_score
0,Length (excluding null values),75244.0,75244.0,75244.0
1,Max,4.0,5.9,98.0
2,Min,1.0,0.1,0.0
3,Mean,1.5008107,2.8433447,22.010925


**Note** In order to make predictions, ML.Net requires a csv file that could be loaded by ML.Net API's. There is no direct way of loading a DataFrame to make predictions and DataFrame API doesn't expose any API to save the DataFrame to a csv file. 

I discovered a library 'Danny' which allows creating, manipulating and saving the DataFrame to a CSV file. 

<img src="assets\daany-dotnet.png" alt="Danny dotnet" style="zoom:60%;margin:auto;">


Link: https://github.com/bhrnjica/daany

I'll be using this library to recreate the DataFrame, perform modifications same as above and save it as a CSV file to be used for prediction. I found it to be very flexible. As this is kind of repetitive thing, you can skip to Training section for model creation and training.


### Data Manipulation
- Take records till 6 overs
- Remove missing values
- Add 'score' column representing runs per ball
- Add 'total_score' column to get the total score till the current ball

**Note** 
The csv dataset file has some values which contains comma ',' in the cell value such as venue having "Simonds Stadium, South Geelong". As a result of which, ML.Net LoadFromTextFile(...) API loads in a wrong manner and consider them to be two separate value instead of single. In order to oversome this, I have written a helper function CreateCsvAndReplaceSeparatorInCells(...) to replace comma with the specified character. 

In [154]:
/// <summary>
/// Replace a character in a cell of csv with a defined separator
/// </summary>
/// <param name="inputFile">Name of input file</param>
/// <param name="outputFile">Name of output file</param>
/// <param name="separator">Separator such as comma</param>
/// <param name="separatorReplacement">Replacement character such as underscore</param>
private static void CreateCsvAndReplaceSeparatorInCells(string inputFile, string outputFile, char separator, char separatorReplacement)
{
    var culture = CultureInfo.InvariantCulture;
    using var reader = new StreamReader(inputFile);
    using var csvIn = new CsvReader(reader, new CsvConfiguration(culture));
    using var recordsIn = new CsvDataReader(csvIn);
    using var writer = new StreamWriter(outputFile);
    using var outCsv = new CsvWriter(writer, culture);

    // Write Header
    csvIn.ReadHeader();
    var headers = csvIn.HeaderRecord;
    foreach (var header in headers)
    {
        outCsv.WriteField(header.Replace(separator, separatorReplacement));
    }
    outCsv.NextRecord();

    // Write rows
    while (recordsIn.Read())
    {
        var columns = recordsIn.FieldCount;
        for (var index = 0; index < columns; index++)
        {
            var cellValue = recordsIn.GetString(index);
            outCsv.WriteField(cellValue.Replace(separator, separatorReplacement));
        }
        outCsv.NextRecord();
    }
}

In [155]:
CreateCsvAndReplaceSeparatorInCells(DATASET_MERGED_CSV,  DATASET_ALL_CSV, ',', '_');

In [156]:
// Load CSV
Daany.DataFrame cricketDaanyDataFrame = Daany.DataFrame.FromCsv(DATASET_ALL_CSV);
display(cricketDaanyDataFrame.RowCount());    

In [157]:
// Remove missing values
var missingValues = cricketDaanyDataFrame.MissingValues();

List<string> nonNullColumns = new List<string>();
foreach (var dfColumn in cricketDaanyDataFrame.Columns)
{
    if (!missingValues.ContainsKey(dfColumn))
    {
        display(dfColumn);
        nonNullColumns.Add(dfColumn);
    }
}

cricketDaanyDataFrame = cricketDaanyDataFrame[nonNullColumns.ToArray()];
display(cricketDaanyDataFrame);

match_id

season

start_date

venue

innings

ball

batting_team

bowling_team

striker

non_striker

bowler

runs_off_bat

extras

Unhandled exception: System.OutOfMemoryException: Insufficient memory to continue the execution of the program.
   at System.Text.StringBuilder.ExpandByABlock(Int32 minBlockCharCount)
   at System.Text.StringBuilder.Append(Char* value, Int32 valueCount)
   at System.Text.StringBuilder.AppendHelper(String value)
   at System.Text.StringBuilder.Append(String value)
   at System.IO.StringWriter.Write(String value)
   at Microsoft.DotNet.Interactive.Formatting.HtmlTag.WriteStartTag(TextWriter writer) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\HtmlTag.cs:line 120
   at Microsoft.DotNet.Interactive.Formatting.HtmlTag.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\HtmlTag.cs:line 96
   at Microsoft.DotNet.Interactive.Formatting.PocketView.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 316
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 219
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 234
   at Microsoft.DotNet.Interactive.Formatting.HtmlTag.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\HtmlTag.cs:line 96
   at Microsoft.DotNet.Interactive.Formatting.PocketView.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 316
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 219
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 219
   at Microsoft.DotNet.Interactive.Formatting.HtmlTag.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\HtmlTag.cs:line 96
   at Microsoft.DotNet.Interactive.Formatting.PocketView.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 316
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 219
   at Microsoft.DotNet.Interactive.Formatting.HtmlTag.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\HtmlTag.cs:line 96
   at Microsoft.DotNet.Interactive.Formatting.PocketView.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 316
   at Microsoft.DotNet.Interactive.Formatting.HtmlFormatter`1.<>c__DisplayClass8_0.<CreateForAnyEnumerable>g__BuildTable|3(T source, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\HtmlFormatter{T}.cs:line 287
   at Microsoft.DotNet.Interactive.Formatting.Formatter.<>c__DisplayClass53_0.<TryInferPreferredFormatter>b__4(Object value, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\Formatter.cs:line 561
   at Microsoft.DotNet.Interactive.Formatting.Formatter`1.FormatTo(T obj, FormatContext context, String mimeType) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\Formatter{T}.cs:line 65
   at Microsoft.DotNet.Interactive.Formatting.Formatter.FormatTo[T](T obj, FormatContext context, String mimeType) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\Formatter.cs:line 285
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 238
   at Microsoft.DotNet.Interactive.Formatting.HtmlTag.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\HtmlTag.cs:line 96
   at Microsoft.DotNet.Interactive.Formatting.PocketView.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 316
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 219
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 219
   at Microsoft.DotNet.Interactive.Formatting.HtmlTag.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\HtmlTag.cs:line 96
   at Microsoft.DotNet.Interactive.Formatting.PocketView.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 316
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 219
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 234
   at Microsoft.DotNet.Interactive.Formatting.HtmlTag.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\HtmlTag.cs:line 96
   at Microsoft.DotNet.Interactive.Formatting.PocketView.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 316
   at Microsoft.DotNet.Interactive.Formatting.PocketView.Write(IReadOnlyList`1 args, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 219
   at Microsoft.DotNet.Interactive.Formatting.HtmlTag.WriteTo(FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\HtmlTag.cs:line 96
   at Microsoft.DotNet.Interactive.Formatting.PocketView.ToString() in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\PocketView.cs:line 264
   at System.IO.TextWriter.Write(Object value)
   at System.Dynamic.UpdateDelegates.UpdateAndExecuteVoid2[T0,T1](CallSite site, T0 arg0, T1 arg1)
   at Submission#166.<>c.<<Initialize>>b__0_0(Object dataFrame, TextWriter writer)
   at Microsoft.DotNet.Interactive.Formatting.Formatter.<>c__DisplayClass47_0.<Register>b__0(Object value, FormatContext context) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\Formatter.cs:line 431
   at Microsoft.DotNet.Interactive.Formatting.Formatter`1.FormatTo(T obj, FormatContext context, String mimeType) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\Formatter{T}.cs:line 65
   at Microsoft.DotNet.Interactive.Formatting.Formatter.FormatTo[T](T obj, FormatContext context, String mimeType) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\Formatter.cs:line 285
   at Microsoft.DotNet.Interactive.Formatting.Formatter.ToDisplayString(Object obj, String mimeType) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive.Formatting\Formatter.cs:line 239
   at Microsoft.DotNet.Interactive.KernelInvocationContextExtensions.Display(KernelInvocationContext context, Object value, String mimeType) in D:\workspace\_work\1\s\src\Microsoft.DotNet.Interactive\KernelInvocationContextExtensions.cs:line 36
   at Submission#170.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
   at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)

In [158]:
// Filter dataset to include till 6 overs and discard beyond that
var sixBallDaanyDataFrame = cricketDaanyDataFrame.Filter(CSV_COLUMN_BALL, OVER_THRESHOLD, FilterOperator.LessOrEqual);

In [116]:
// Add 'score' column
sixBallDaanyDataFrame.AddCalculatedColumn(CSV_COLUMN_SCORE, (r, i) =>
{
    return Convert.ToInt32(r[CSV_COLUMN_RUNS_OFF_BAT]) + Convert.ToInt32(r[CSV_COLUMN_EXTRAS]);
});

#### Calculating Total Score
There is no direct way to calculate the total score as it has to be cumulative sum and restricted to 6 overs only in the dataset. I have written a function to calculat the Total Score

In [117]:
/// <summary>
/// Adds a total_score column to the DataFrame by calculating the cumulative sum
/// </summary>
/// <param name="df">DataFrame to which new column is added</param>
public void AddTotalScorePerBallDaany(Daany.DataFrame df)
{
    // https://github.com/bhrnjica/daany
    // calculate total_score per ball
    df.AddCalculatedColumn(CSV_COLUMN_TOTAL_SCORE, (r, i) =>
    {
        var response = Convert.ToSingle(r[CSV_COLUMN_SCORE]);
        return 0;
    });

    int previousMatchId = -1;
    int previousInning = -1;
    float previousScore = -1;
    int rowIndex = 0;

    foreach (var dfRow in df.GetRowEnumerator())
    {
        var matchId = (int)dfRow[0];
        var inning = (int)dfRow[4];
        var score = Convert.ToSingle(dfRow[df.Columns.Count - 2]); // Score

        if (previousMatchId == -1) // First time
        {
            // Reset
            previousMatchId = matchId;
            previousInning = inning;

            if (previousScore == -1)
            {
                previousScore = score;
            }
        }

        if (matchId == previousMatchId && inning == previousInning)
        {
            float newScore = previousScore + score;

            // Total Score
            dfRow[df.Columns.Count - 1] = newScore;

            df[rowIndex, df.Columns.Count - 1] = newScore;
            previousScore = newScore;
        }
        else
        {
            // Total Score
            dfRow[df.Columns.Count - 1] = score;
            df[rowIndex, df.Columns.Count - 1] = score;

            // Reset
            previousMatchId = -1;
            previousInning = -1;
            previousScore = score;
        }

        rowIndex++;
    }
}

In [118]:
AddTotalScorePerBallDaany(sixBallDaanyDataFrame);
display(sixBallDaanyDataFrame);

Columns,ColTypes,Index,Shape
"[ match_id, season, start_date, venue, innings, ball, batting_team, bowling_team, striker, non_striker, bowler, runs_off_bat, extras, score, total_score ]","[ I32, STR, DT, STR, I32, F32, STR, STR, STR, STR, STR, I32, I32, I32, I32 ]","[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ... (75224 more) ]","( 75244, 15 )"


In [119]:
// Filter columns
sixBallDaanyDataFrame = sixBallDaanyDataFrame[CSV_COLUMN_VENUE, CSV_COLUMN_INNINGS, CSV_COLUMN_BALL, CSV_COLUMN_BATTING_TEAM, CSV_COLUMN_BOWLING_TEAM, CSV_COLUMN_STRIKER, CSV_COLUMN_NON_STRIKER, CSV_COLUMN_BOWLER, CSV_COLUMN_TOTAL_SCORE];
display(sixBallDaanyDataFrame);

Columns,ColTypes,Index,Shape
"[ venue, innings, ball, batting_team, bowling_team, striker, non_striker, bowler, total_score ]","[ STR, I32, F32, STR, STR, STR, STR, STR, I32 ]","[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 ... (75224 more) ]","( 75244, 9 )"


In [120]:
// Save DataFrame to CSV
Daany.DataFrame.ToCsv(DATASET_CLEANED_CSV, sixBallDaanyDataFrame);

### Final Dataset Columns
- Venue
- Innings
- Ball
- Batting_Team
- Bowling_Team
- Striker
- Non_Striker
- Bowler
- Total_Score

## **Training**

### Data Classes

We need to create few data structures to map the columns within our dataset.

#### Match

In [145]:
/// <summary>
/// Represents the input features involved in prediction
/// </summary>
public class Match
{
    /// <summary>
    /// Match venue
    /// </summary>
    [LoadColumn(0)]
    public string Venue { get; set; }

    /// <summary>
    /// Innning within a Match
    /// </summary>
    [LoadColumn(1)]
    public float Inning { get; set; }

    /// <summary>
    /// Current Ball being thrown
    /// </summary>
    [LoadColumn(2)]
    public float Ball { get; set; }

    /// <summary>
    /// Name of the batting team
    /// </summary>
    [LoadColumn(3)]
    public string BattingTeam { get; set; }

    /// <summary>
    /// Name of the bowling team
    /// </summary>
    [LoadColumn(4)] 
    public string BowlingTeam { get; set; }

    /// <summary>
    /// Batsman on strike
    /// </summary>
    [LoadColumn(5)]
    public string Striker { get; set; }

    /// <summary>
    /// Non striker batsman
    /// </summary>
    [LoadColumn(6)]
    public string NonStriker { get; set; }

    /// <summary>
    /// Current bowler
    /// </summary>
    [LoadColumn(7)]
    public string Bowler { get; set; }

    /// <summary>
    /// Total score till the current ball
    /// </summary>
    [LoadColumn(8)]
    public float TotalScore { get; set; }
    
    public override string ToString()
    {
        var sb = new StringBuilder();
        sb.Append($"Venue: {Venue}");
        sb.Append($"\nBatting Team: {BattingTeam}");
        sb.Append($"\nBowling Team: {BowlingTeam}");
        sb.Append($"\nInning: {Inning}");
        sb.Append($"\nBall: {Ball}");
        sb.Append($"\nStriker: {Striker}");
        sb.Append($"\nNon-Striker: {NonStriker}");
        sb.Append($"\nBowler: {Bowler}");

        return sb.ToString();
    }
}

#### MatchScorePrediction

In [126]:
/// <summary>
/// Manages the score prediction 
/// </summary>
public class MatchScorePrediction
{
    /// <summary>
    /// Total runs scored by the team at the specified ball
    /// </summary>
    [ColumnName("Score")]
    public float TotalScore { get; set; }
}

### Load Dataset

In [130]:
// Load the dataset
var mlContext = new MLContext(seed:1);
IDataView data = mlContext.Data.LoadFromTextFile<Match>(
    path:DATASET_CLEANED_CSV,
    hasHeader: true,
    separatorChar: ',');

In [131]:
// Split dataset
var trainTestData = mlContext.Data.TrainTestSplit(data, 0.2); // Training/Test : 80/20

In [135]:
// Transform
var dataProcessPipeline = mlContext.Transforms.CopyColumns(outputColumnName: "Label", inputColumnName: nameof(Match.TotalScore))

    .Append(mlContext.Transforms.Categorical.OneHotEncoding("VenueEncoded", nameof(Match.Venue)))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("BattingTeamEncoded", nameof(Match.BattingTeam)))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("BowlingTeamEncoded", nameof(Match.BowlingTeam)))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("StrikerEncoded", nameof(Match.Striker)))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("NonStrikerEncoded", nameof(Match.NonStriker)))
    .Append(mlContext.Transforms.Categorical.OneHotEncoding("BowlerEncoded", nameof(Match.Bowler)))

    .Append(mlContext.Transforms.Concatenate("Features", 
        "VenueEncoded",
                        nameof(Match.Inning),
                        nameof(Match.Ball),
                        "BattingTeamEncoded",
                        "BowlingTeamEncoded",
                        "StrikerEncoded",
                        "NonStrikerEncoded",
                        "BowlerEncoded"
    ));





In [138]:
// Train
var trainingPipeline = dataProcessPipeline.Append(mlContext.Regression.Trainers.FastTree(labelColumnName: "Label", featureColumnName: "Features"));
var trainedModel = trainingPipeline.Fit(trainTestData.TrainSet);

## **Evaluate**

In [140]:
var predictions = trainedModel.Transform(trainTestData.TestSet);
var metrics = mlContext.Regression.Evaluate(predictions, "Label", "Score");
display($"*************************************************");
display($"*       Model quality metrics evaluation         ");
display($"*------------------------------------------------");
display($"*       RSquared Score:      {metrics.RSquared:0.##}");
display($"*       Root Mean Squared Error:      {metrics.RootMeanSquaredError:#.##}");

*************************************************

*       Model quality metrics evaluation         

*------------------------------------------------

*       RSquared Score:      0.82

*       Root Mean Squared Error:      6.73

## **Save Model**

In [142]:
// Save
var savedPath = Path.Combine(Directory.GetCurrentDirectory(), MODEL_FILE);
mlContext.Model.Save(trainedModel, trainTestData.TrainSet.Schema, savedPath);
display($"The model is saved to {savedPath}");

The model is saved to D:\Praveen\sourcecontrol\github\praveenraghuvanshi\tech-sessions\drafts\Sport Analytics\model.zip

## **Prediction**

In [152]:
// Predict
display("*********** Predict...");
var predictionEngine = mlContext.Model.CreatePredictionEngine<Match, MatchScorePrediction>(trainedModel);
var match = new Match
{
    Ball = 3.4f,
    BattingTeam = "India",
    BowlingTeam = "New Zealand",
    Bowler = "A Nehra",
    Inning = 1,
    Striker = "V Kohli",
    NonStriker = "Yuvraj Singh",
    Venue = "Vidarbha Cricket Association Stadium_ Jamtha"
};

// make the prediction
var prediction = predictionEngine.Predict(match);

// report the results
display($"Match Info:\n\n{match} ");
display($"\n^^^^^^ Prediction:  {prediction.TotalScore} ");
display($"**********************************************************************");
display($"Predicted score: {prediction.TotalScore:0.####}, actual score: 26");
display($"**********************************************************************");

*********** Predict...

Match Info:

Venue: Vidarbha Cricket Association Stadium_ Jamtha
Batting Team: India
Bowling Team: New Zealand
Inning: 1
Ball: 3.4
Striker: V Kohli
Non-Striker: Yuvraj Singh
Bowler: A Nehra 


^^^^^^ Prediction:  24.15512 

**********************************************************************

Predicted score: 24.1551, actual score: 26

**********************************************************************

## Problem statement
EDA
 - Data Cleaning
 - Analysis of Dataset
 - Visualizations

Prediction
 - Algorithm(Linear Regression)
 - Parameter Tuning
 - Better algorithm
 - Accuracy/Loss
 - Model Builder
 