Check relationships between the target variable and numeric features. Via _ **seaborn heatmap** _ plot.

**Graphics** 
- Contingency tables
- Scatter plot
- Bar chart
- Biplot
- Box plot
- Control chart
- Correlogram
- Fan chart
- Forest plot
- Histogram
- Pie chart
- Q–Q plot
- Run chart
- Scatter plot
- Stem-and-leaf display
- Radar chart

### Non-hierarchical
- Histogram
- Pi
- Stack
- Chord
- Force

Hierarchical
- Tree
- Cluster
- Tree map
- Partition
- Pack

**T-Distribution** - Visualize high density

Check relationships between the target variable and numeric features. Via _ **seaborn heatmap** _ plot.

## Basic Summary

In [None]:
# Define a summary function
def summary(x, **kwargs):
    # Convert to a pandas series
    x = pd.Series(x)
    
    # Get stats for the series
    label = x.describe()[['mean', 'std', 'min', '50%', 'max']]
    
    # Convert from log to regular scale
    # Adjust the column names for presentation
    if label.name == 'log_pop':
        label = 10 ** label
        label.name = 'pop stats'
    elif label.name == 'log_gdp_per_cap':
        label = 10 ** label
        label.name = 'gdp_per_cap stats'
    else:
        label.name = 'life_exp stats'
       
    # Round the labels for presentation
    label = label.round()
    ax = plt.gca()
    ax.set_axis_off()

    # Add the labels to the plot
    ax.annotate(pd.DataFrame(label),
               xy = (0.1, 0.2), size = 20, xycoords = ax.transAxes)    
    

# Create a pair grid instance
grid = sns.PairGrid(data= df[df['year'] == 2007],
                    vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'], size = 4)

# Fill in the mappings
grid = grid.map_upper(plt.scatter, color = 'darkred')
grid = grid.map_upper(corr)
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')
grid = grid.map_diag(summary);

Below is the code to plot the univariate distribution of the numerical columns which contains the histograms and the estimated PDF. We use displot of the seaborn library to plot this graph:

In [None]:
col_names = ['StrengthFactor','PriceReg', 'ReleaseYear', 'ItemCount', 'LowUserPrice', 'LowNetPrice']

fig, ax = plt.subplots(len(col_names), figsize=(16,12))

for i, col_val in enumerate(col_names):

    sns.distplot(sales_data_hist[col_val], hist=True, ax=ax[i])
    ax[i].set_title('Freq dist '+col_val, fontsize=10)
    ax[i].set_xlabel(col_val, fontsize=8)
    ax[i].set_ylabel('Count', fontsize=8)

plt.show()

# Visualizations

## Graphical Integrity

https://www.amazon.com/Visual-Display-Quantitative-Information/dp/1930824130

there are six principles to ensure Graphical Integrity:

- Make the representation of numbers proportional to quantities
- Use clear, detailed, and thorough labeling
- Show data variation, not design variation
- Use standardized units, not nominal values
- Depict ’n’ data dimensions with less than or equal to ’n’ variable dimensions
- Quote data in full context

## Misc Visualization Notes

https://realpython.com/python-data-visualization-bokeh

Histograms and Density Plots

Histograms work very well for display a single variable from one category (in this case the one category was all the flights). However, for displaying multiple categories, a histogram does not work well because the plots are obscured.

Solution 1: Side-by-Side Histograms

Solution 2: Stacked Histograms

Solution 3: Density Plots

Density with Rug Plot

In [None]:
DataFrame.plot([x, y, kind, ax, ….])	DataFrame plotting accessor and method
DataFrame.plot.area([x, y])	Draw a stacked area plot.
DataFrame.plot.bar([x, y])	Vertical bar plot.
DataFrame.plot.barh([x, y])	Make a horizontal bar plot.
DataFrame.plot.box([by])	Make a box plot of the DataFrame columns.
DataFrame.plot.density([bw_method, ind])	Generate Kernel Density Estimate plot using Gaussian kernels.
DataFrame.plot.hexbin(x, y[, C, …])	Generate a hexagonal binning plot.
DataFrame.plot.hist([by, bins])	Draw one histogram of the DataFrame’s columns.
DataFrame.plot.kde([bw_method, ind])	Generate Kernel Density Estimate plot using Gaussian kernels.
DataFrame.plot.line([x, y])	Plot DataFrame columns as lines.
DataFrame.plot.pie([y])	Generate a pie plot.
DataFrame.plot.scatter(x, y[, s, c])	Create a scatter plot with varying marker point size and color.
DataFrame.boxplot([column, by, ax, …])	Make a box plot from DataFrame columns.
DataFrame.hist([column, by, grid, …])	Make a histogram of the DataFrame’s.

### Create a Pairplot Using a Scatter, Histogram, Density Plots

Default Pair Plot with All Data
Let's use the entire dataset and sns.pairplot to create a simple, yet useful plot.


Group and Color by a Variable
In order to better understand the data, we can color the pairplot using a categorical variable and the hue keyword. First, we will color the plots by the continent.


Customizing pairplot
First, let's change the diagonal from a histogram to a kde which can better show the differences between continents. We can also adjust the alpha (intensity) of the scatter plots to better show all the data and change the size of the markers on the scatter plot. Finally, I increase the size of all the plots to better show the data.

sns.pairplot(df, hue = 'continent', diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);


The density plots on the diagonal are better for when we have data in multiple categories to make comparisons. We can color the plot by any variable we like. For example, here is a plot colored by a decade categorical variable we create from the year column.


df['decade'] = pd.cut(df['year'], bins = range(1950, 2010, 10))

sns.pairplot(df, hue = 'decade', diag_kind = 'kde', vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'],
             plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);

In [None]:
# Define a summary function
def summary(x, **kwargs):
    # Convert to a pandas series
    x = pd.Series(x)
    
    # Get stats for the series
    label = x.describe()[['mean', 'std', 'min', '50%', 'max']]
    
    # Convert from log to regular scale
    # Adjust the column names for presentation
    if label.name == 'log_pop':
        label = 10 ** label
        label.name = 'pop stats'
    elif label.name == 'log_gdp_per_cap':
        label = 10 ** label
        label.name = 'gdp_per_cap stats'
    else:
        label.name = 'life_exp stats'
       
    # Round the labels for presentation
    label = label.round()
    ax = plt.gca()
    ax.set_axis_off()

    # Add the labels to the plot
    ax.annotate(pd.DataFrame(label),
               xy = (0.1, 0.2), size = 20, xycoords = ax.transAxes)    
    

# Create a pair grid instance
grid = sns.PairGrid(data= df[df['year'] == 2007],
                    vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'], size = 4)

# Fill in the mappings
grid = grid.map_upper(plt.scatter, color = 'darkred')
grid = grid.map_upper(corr)
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')
grid = grid.map_diag(summary);

Below is the code to plot the univariate distribution of the numerical columns which contains the histograms and the estimated PDF. We use displot of the seaborn library to plot this graph:

In [None]:
col_names = ['StrengthFactor','PriceReg', 'ReleaseYear', 'ItemCount', 'LowUserPrice', 'LowNetPrice']

fig, ax = plt.subplots(len(col_names), figsize=(16,12))

for i, col_val in enumerate(col_names):

    sns.distplot(sales_data_hist[col_val], hist=True, ax=ax[i])
    ax[i].set_title('Freq dist '+col_val, fontsize=10)
    ax[i].set_xlabel(col_val, fontsize=8)
    ax[i].set_ylabel('Count', fontsize=8)

plt.show()

### BoxPlot


Below is the code to plot the box plot of all the column names mentioned in the list col_names. The box plot allows us to visually analyze the outliers in the dataset.
​
The key terminology to note here are as follows:
​
- The range of the data provides us with a measure of spread and is equal to a value between the smallest data point (min) and the largest one (Max)
​
- The interquartile range (IQR), which is the range covered by the middle 50% of the data.
​
- IQR = Q3 - Q1, the difference between the third and first quartiles. The first quartile (Q1) is the value such that one quarter (25%) of the data points fall below it, or the median of the bottom half of the data. The third quartile is the value such that three quarters (75%) of the data points fall below it, or the median of the top half of the data.
​
- The IQR can be used to detect outliers using the 1.5(IQR) criteria. Outliers are observations that fall below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR).
​
​
Based on the above definition of how we identify outliers the black dots are outliers in the strength factor attribute and the red colored box is the IQR range.

In [None]:
# To check if there are any null values in the dataset : True False

data_frame.isnull().values.any()

# snippet returns the total number of missing values across different columns

data_frame.isnull().sum()

# replace the missing values
data_frame['col_name'].fillna(0, inplace=True)

col_names = ['StrengthFactor','PriceReg', 'ReleaseYear', 'ItemCount', 'LowUserPrice', 'LowNetPrice']

fig, ax = plt.subplots(len(col_names), figsize=(8,40))

for i, col_val in enumerate(col_names):

    sns.boxplot(y=sales_data_hist[col_val], ax=ax[i])
    ax[i].set_title('Box plot - {}'.format(col_val), fontsize=10)
    ax[i].set_xlabel(col_val, fontsize=8)

plt.show()

## Word Cloud

In [None]:
def word_cloud(tweets):
    
    #We get the directory that we are working on
    file = os.getcwd()
    #We read the mask image into a numpy array
    avengers_mask = np.array(Image.open(os.path.join(file, "avengers.png")))
    #Now we store the tweets into a series to be able to process 
    #tweets_list = pd.Series([t for t in tweet_table.tweet]).str.cat(sep=' ')  
    #We generate the wordcloud using the series created and the mask 
    word_cloud = WordCloud(width=2000, height=1000, max_font_size=200, background_color="black", max_words=2000, mask=avengers_mask, contour_width=1, 
                           contour_color="steelblue", colormap="nipy_spectral", stopwords=["avengers"])
    word_cloud.generate(tweets)
    
    #wordcloud = WordCloud(width=1600, height=800,max_font_size=200).generate(tweets_list)
    
    #Now we plot both figures, the wordcloud and the mask
    #plt.figure(figsize=(15,15))
    plt.figure(figsize=(10,10))
    plt.imshow(word_cloud, interpolation="hermite")
    plt.axis("off")
    #plt.imshow(avengers_mask, cmap=plt.cm.gray, interpolation="bilinear")
    #plt.axis("off")    
    plt.show()  