## Prerequisites

numpy==1.16.4  
pandas==0.25.0  
matplotlib==3.1.0  
seaborn==0.9.0

#### In this notebook you will learn the basics of the main python libraries used for data analysis: 

    - pandas
    - numpy
    - matplotlib 
    

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

%matplotlib inline

## Working with arrays, Numpy 

In [2]:
a = np.array([1, 2, 3])   # Create a rank 1 array
print(type(a))            # Prints "<class 'numpy.ndarray'>"
print(a.shape)            # Prints "(3,)"
print(a[0], a[1], a[2])   # Prints "1 2 3"

<class 'numpy.ndarray'>
(3,)
1 2 3


In [3]:
a[0] = 5                  # Change an element of the array
print(a)                  # Prints "[5, 2, 3]"

[5 2 3]


In [4]:
b = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
print(b.shape)                     # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"

(2, 3)
1 2 4


In [5]:
a = np.zeros((2,2))   # Create an array of all zeros
print(a)              # Prints "[[ 0.  0.]
                      #          [ 0.  0.]]"

b = np.ones((1,2))    # Create an array of all ones
print(b)              # Prints "[[ 1.  1.]]"

c = np.full((2,2), 7)  # Create a constant array
print(c)               # Prints "[[ 7.  7.]
                       #          [ 7.  7.]]"

d = np.eye(2)         # Create a 2x2 identity matrix
print(d)              # Prints "[[ 1.  0.]
                      #          [ 0.  1.]]"

e = np.random.random((2,2))  # Create an array filled with random values
print(e)                     # Might print "[[ 0.91940167  0.08143941]
                             #               [ 0.68744134  0.87236687]]"

[[0. 0.]
 [0. 0.]]
[[1. 1.]]
[[7 7]
 [7 7]]
[[1. 0.]
 [0. 1.]]
[[0.00135321 0.07650768]
 [0.17242192 0.38314889]]


In [6]:
# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]

In [7]:
# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print(a[0, 1])   # Prints "2"
b[0, 0] = 77     # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1])   # Prints "77"

2
77


### Math 

#### Elementwise operations

In [8]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

In [9]:
# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

[[ 6.  8.]
 [10. 12.]]
[[ 6.  8.]
 [10. 12.]]


In [10]:
# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))

[[-4. -4.]
 [-4. -4.]]
[[-4. -4.]
 [-4. -4.]]


In [11]:
# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))

[[ 5. 12.]
 [21. 32.]]
[[ 5. 12.]
 [21. 32.]]


In [12]:
# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

[[0.2        0.33333333]
 [0.42857143 0.5       ]]
[[0.2        0.33333333]
 [0.42857143 0.5       ]]


In [13]:
# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

[[1.         1.41421356]
 [1.73205081 2.        ]]


#### Vectorized operations

In [14]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

In [15]:
# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

219
219


In [16]:
# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v))
print(np.dot(x, v))

[29 67]
[29 67]


In [17]:
# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
#  [43 50]]
print(x.dot(y))
print(np.dot(x, y))

[[19 22]
 [43 50]]
[[19 22]
 [43 50]]


In [18]:
x = np.array([[1,2],[3,4]])

print(np.sum(x))  # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0))  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"

10
[4 6]
[3 7]


## Load data

Load .zip archive from this link, unzip it and place in the same folder as this notebook:  
    https://www.kaggle.com/fizzbuzz/cleaned-toxic-comments

In [20]:
# Load data 
df = pd.read_csv("train_preprocessed.csv")

FileNotFoundError: [Errno 2] File b'cleaned-toxic-comments/train_preprocessed.csv' does not exist: b'cleaned-toxic-comments/train_preprocessed.csv'

In [None]:
# Explore a few lines from the table
df.head(10)

In [None]:
# Same as previous, but from the end of the file, defaul number of lines = 5
df.tail() 

#### Documentation   

Refer to the documentation from this link to find some information about working with pandas DataFrames:  
    https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.DataFrame.html 

Please, show which columns are available in this dataframe (For example: 'comment_text', 'id', ...):

In [None]:
#### Your code here 

Please, show the DataFrame's shape (rows, columns) 

In [None]:
#### Your code here 

Calculate how much commens are labelled as:

 1. Identity hate message 
 2. Insult message
 3. Obscene message  
etc... 
 6. Toxic message. 

In [None]:
#### Your code here

## Pre-process dataset

You can make our DataFrame smaller to make it easier to work with it: 

In [None]:
df_sample = df.sample(n=1000) # random selection 
df_small = df[:100] # select the first 100 rows

In [None]:
# Check the data type
df_sample.dtypes

In [None]:
# Check duplicated rows and delete them if any
duplicate_rows_df = df[df.duplicated()]

print("number of duplicate rows: ", duplicate_rows_df.shape)

In [None]:
# Drop unnecessary columns:

df.drop(columns='set', inplace=True)

## Visualizations

### Histogram plot 

In [None]:
# Count label occurences

labels = df[['identity_hate', 'insult', 'obscene', 'severe_toxic', 'threat', 'toxic']].sum()

In [None]:
labels

In [None]:
plt.figure(figsize = (8, 4))
ax = sns.barplot(labels.index, labels.values, alpha = 0.8)
plt.title("# per class")
plt.ylabel('# of occurrences', fontsize = 12)
plt.xlabel('Type ', fontsize = 12)

# Add text labels
rects = ax.patches
labels = labels.values
for rect, label in zip(rects, labels):
    height = rect.get_height()
    ax.text(rect.get_x() + rect.get_width() / 2, height + 5, label, ha = 'center', va = 'bottom')

plt.show()

Comment what you see here: 

In [None]:
#### Your text here

### See how labels correlate with each other: 

In [None]:
temp_df = df[['identity_hate', 'insult', 'obscene', 'severe_toxic', 'threat', 'toxic']]

corr = temp_df.corr()
plt.figure(figsize = (10, 8))
sns.heatmap(corr,
            xticklabels = corr.columns.values,
            yticklabels = corr.columns.values, annot=True)

The above plot indicates a pattern of co-occurance. 

Comment what you see from the above picture:

In [None]:
#### Your text here

## Wordclouds

Calculate the number of the uniq words in all of the comments. (Tip: to split text on words use text.split() command, it will separate your text by space)

In [None]:
### Your code here

Let's work with wordclouds. 
The next task would be to select all of the words from the textual data and create a wordcloud. 
Here you can see an example of such visualisation: 

https://towardsdatascience.com/word-clouds-in-python-comprehensive-example-8aee4343c0bf 

Create the same visualization for our dataset. 
Describe what you see.

In [None]:
### Your code for the wordcloud here

## Distributions

The main goal of this task is to plot words distributions for each category. 
What does it mean: 

1. You need to select words from each category (identity_hate, insult, etc.) 
2. Plot a historgram with the most popular words for each category 
3. Try to delete stop words: 
    1. Install nltk library
    2. from nltk.corpus import stopwords - in the stopwords you will see the most common stopwords 
    3. Filter them from the words for each category
4. Plot a histogram again. Has it changed? 
5. Analyse received results. 


In [None]:
#### Your code here 

In [None]:
#### Your explanation here

## Conclusions

Please, write down what did you learn and find during this task.  
What was the most difficult part?  
What did you enjoy?  
Suggest your improvements.  

In [None]:
#### Your text here