Hi! 
Thanks for checking this out! This is my first attempt at building a Kernel with Markdowns, so this notebook may occasionally be a little difficult to understand. 

Since this is my first competition on Kaggle, I decided to write a very crude and simple "analyzation" of the provided data for learning purposes:

First, I imported some libraries used for grabbing the data(pandas) and plotting the data(matplotlib):

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

train_data = pd.read_csv("../input/en_train.csv")
test_data = pd.read_csv("../input/en_test.csv")

Now, let's see what the dimentions and the preview of the provided data are:

In [10]:
print(train_data.shape)
print(test_data.shape)
print(train_data.head())
print(test_data.head())

We can see that we have a reasonable amount of data to train on, and a dataset that's easy to understand. 

Now, we're probably going to have to change the methods of converting on every class(e.g. we should change how we process the classes punctuation and numbers differently), we want to see how many classes there are:

In [11]:
class_types = train_data["class"].unique()
print(class_types)

Now, since we don't want to do any redundant computations, we want to see exactly what percentage of each class has a different value of "before"'s and "after"'s. 

First, let's get all the data which satisfies (before != after):

In [12]:
#Get data which have different "before"'s and "after"'s:

class_type_dict = dict()
for i in class_types:
    class_data = train_data.loc[train_data["class"] == i]
    class_type_dict[i] = class_data.loc[class_data["before"] != class_data["after"]]
    
print(class_type_dict)

Now, we have a dictionary that contains all the changed data grouped by class.

Next, let's plot a the data to compare the amount of data that each class has:

In [13]:
import seaborn as sns #Colored Plotting library

group_class = train_data.groupby('class',as_index=False).count()

buf = sns.barplot(x=group_class['class'],y=group_class['before'])
buf.set(xlabel='Classes', ylabel='Total Data Count')

for x in buf.get_xticklabels():
    x.set_rotation(90)

As we can see, plain words dominate the majority of the data, with punctuations around a quarter of that! 

Now, let's get all the data with (before != after):

In [14]:
diff_data = train_data.loc[train_data["before"] != train_data["after"]]

group_class_change = diff_data.groupby('class',as_index=False).count()

buf = sns.barplot(x=group_class_change['class'],y=group_class_change['before'])
buf.set(xlabel='Classes', ylabel='Changed Data Count')

for x in buf.get_xticklabels():
    x.set_rotation(90)

Even though words take up most of our data, we can see that classes that are dates and letters compose most of the data that satisfies (before != after)!

However, if we take a closer look at the classes in the 2 bar charts, we notice that the class "PUNCT" for punctuation is missing in the second chart! 

This means that **none** of the punctuation data changes, which can be expected because we can't normalize punctuations.

Now, we still want the percentage of change, so we remove the "PUNCT" class from "group_class" because errors would fly around because we don't have a class with label "PUNCT" in our second barchart:

In [15]:
#Drop "PUNCT" in Dataframe group_class:
group_class = group_class.drop(group_class.index[12]) 

Finally, let's get the approx percentage of (before != after) of all the classes:

In [16]:
change_percentage = dict.fromkeys(["class", "percentage"]) #Dictionary to store rows and columns to later convert to a DataFrame
change_percentage["class"] = []; change_percentage["percentage"] = []
for i in group_class_change["class"]:
    val_1 = list(group_class_change.loc[group_class_change["class"] == i]["before"])[0]
    val_2 = list(group_class.loc[group_class["class"] == i]["before"])[0]
    percentage = 100 * val_1/val_2
    change_percentage["class"].append(i); change_percentage["percentage"].append(percentage)
change_percentage["class"].append("PUNCT"); change_percentage["percentage"].append(0.00000)

change_percentage = pd.DataFrame(change_percentage)
print(change_percentage)
buf = sns.barplot(x=change_percentage['class'],y=change_percentage['percentage'])
for x in buf.get_xticklabels():
    x.set_rotation(90)

From this, we can see that other than the classes **PLAIN** and **PUNCT** rarely change(0.49% and 0% respectively).

**And, that concludes my crude analysis of the data.** 

We can see that our data:
* Plain and Punctuation classes compose most of the data
* Other than the Plain and Punctuation classes, most classes' before != after percentages are over 95%, and the Verbatim class has a rating of 33%

Hope this helps! It would be awesome if you would let me know if my code/analyzation:
* Needs more analyzation
* Code is too redundant
etc. That would help a lot for a novice like me!