# Tech Stack Migrations Analysis <a id='back'></a> 

* [Introduction](#intro)
* [1 Imports, Load, and Data Overview](#data_over)
    * [1.1 Overview Conclusion](#conc_over)
* [2 Migrations Analysis](#analysis)
    * [2.1 Analysis Conclusion](#conc_a)
* [4 Final Conclusion](#conc)

# Introduction <a id='intro'></a>

In this exploratory data analysys we will analyze our data and note general trends.

Additionally, we will test one wild-guess hypothesis: most tech companies are moving to golang.

## Imports, Load, and Data Overview

<a id='data_over'></a>

In [1]:
# importing pandas, a general data-management library
import pandas as pd

# importing numpy, a general statistics library
import numpy as np

# importing scipy, a statistical analysis library
from scipy import stats as st

#import plotly express a quick and dirty graph plotting library
import plotly.express as px

In [2]:
# Load the dataset into a dataframe
migrations = pd.read_csv('migrations.csv')

# Display general information about the dataset 
migrations.info()
display(migrations.describe())
display(migrations.sample(n=5, random_state=1))

# check for duplicates
migrations_dup = migrations.duplicated().sum()
print('\nFull duplicate rows:',migrations_dup)
# check duplicate company names
migrations_dup = migrations['company'].str.lower().duplicated().sum()
print('\nDuplicate company names:',migrations_dup)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 237 entries, 0 to 236
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   company  237 non-null    object
 1   url      237 non-null    object
 2   year     237 non-null    int64 
 3   from     237 non-null    object
 4   to       237 non-null    object
dtypes: int64(1), object(4)
memory usage: 9.4+ KB


Unnamed: 0,year
count,237.0
mean,2018.345992
std,3.217819
min,2005.0
25%,2017.0
50%,2019.0
75%,2021.0
max,2023.0


Unnamed: 0,company,url,year,from,to
201,Etsy,https://codeascraft.com/2021/11/08/etsys-journ...,2021,Javascript,Typescript
130,PayPal,https://go.dev/solutions/paypal/,2020,C++,Golang
88,Fanatics,https://www.singlestore.com/blog/how-fanatics-...,2018,ElasticSearch,SingleStore
95,2FintechGiants,https://www.youtube.com/watch?v=IG1E7O1rl-s,2019,Oracle,CockroachDB
211,WeWatch,https://jerseyfonseca.com/blogs/mongodb-to-pos...,2021,MongoDB,PostgreSQL



Full duplicate rows: 0

Duplicate company names: 28


## Overview Conclusion <a id='over_conc'></a>

We can see our dataset regarding tech companies' stack migrations has the following columns containing respective data:
* 1. Company - the name of the company which migrated
* 2. url - the url of the news source detailing the migration
* 3. year - the year of the migration
* 4. from - the stack the company migrated from
* 5. to - the stack the company migrated to

We can see the dataset contains no duplicate rows but does detail a few companies which migrated stacks multiple times. This will have no bearing on our analysis nor our hypothesis.

[Back to Contents](#back)

## Migrations Analysis  <a id='analysis'></a>

Now we will use the features of plotly express to perform a visually-assisted analysis.

In [14]:
# First, add a count column with a val of 1 to assist counting the total num of migrations
migrations['count'] = 1

In [15]:
# plot the distribution of migrations by-year
plt = px.histogram(migrations, x='year', y='count', histfunc='count')
plt.show()

We can see the distribution of stack migrations in our dataset(of-note!) is a roughly normal distribution skewed left with a mean around 2018.

In [35]:
# plot total migrations from-stacks by-stack 
plt = px.bar(migrations, x='from', y='count', color='year')
plt.show()

# print the top 20 stacks migrated-from
print(migrations.groupby('from')['count'].value_counts().sort_values(ascending=False).head(20))

from           count
MySQL          1        31
MongoDB        1        15
Ruby           1        14
Python         1        11
Cassandra      1         9
PHP            1         8
PostgreSQL     1         7
NodeJS         1         7
React          1         6
Oracle         1         6
Kubernetes     1         6
ElasticSearch  1         5
AWS            1         4
MSSQLServer    1         4
C++            1         3
RedShift       1         3
Scala          1         3
InfluxDB       1         2
Perl           1         2
Java           1         2
Name: count, dtype: int64


We can see the top 5 stacks most frequently migrated-from are MySQL, MongoDB, Python, Ruby, and Cassandra.

In [36]:
# plot total migrations to-stacks by-stack 
plt = px.bar(migrations, x='to', y='count', color='year')
plt.show()

# print the top 20 stacks migrated-to
print(migrations.groupby('to')['count'].value_counts().sort_values(ascending=False).head(20))

to               count
Golang           1        43
TiDB             1        31
SingleStore      1        23
PostgreSQL       1        12
Clickhouse       1        11
ScyllaDB         1        10
YugabyteDB       1         9
Rust             1         4
VictoriaMetrics  1         4
Dart             1         3
Svelte           1         3
Elixir           1         3
CockroachDB      1         3
Couchbase        1         3
NodeJS           1         3
Kotlin           1         2
MySQL            1         2
Hack             1         2
Nomad            1         2
Nomad/Consul     1         2
Name: count, dtype: int64


Additionally, We can see the stacks most commonly migrated-to are Golang, TiDB, Singlestore, PosgreSQL, Clickhouse, and ScyllaDB.

In [13]:
# plot a scatterplot of migrations inbetween stacks, color-coded by year. 
plt = px.scatter(migrations, x='to', y='from', color='year')
plt.show()

Here, we can see the frequency of those top 6 stacks which were migrated-to.

## Analysis Conclusion<a id='conc_a'></a>

Based upon the information we gathered above we can make a safe assumption that there are two distinct paths of stack-change.

1. A change in database choice with the most common choices being TiDB and SingleStore
2. A change in choice of backend language, the most common choice being golang by a wide margin.

[Back to Contents](#back)

## Final Conclusion <a id='conc'></a>

We can see the wild-guess is moderately validated at first-glance, however there's more to it.

After a quick bit of digging I discovered TiDB is written in golang.

So in-fact, our most popular choice of language to migrate to is Go by a margin of over triple the count of the next-most-popular.

[Back to Contents](#back)