# Data Analysis Examples

[Jian Tao](https://coehpc.engr.tamu.edu/people/jian-tao/), Texas A&M University

Aug 10, 2019

Converted from 

**pandas workshop notebook** 

by [Luciano Strika](https://github.com/StrikingLoo/pandas_workshop)

## 1. Data exploration with panda

In [None]:
%matplotlib inline
import pandas as pd
import random

### First, we will generate a data set randomly

In [None]:
names = ["Albert","John","Richard","Henry","William"]
surnames = ["Goodman","Black","White","Green","Joneson"]
gender = ["F", "M"]
salaries = [500*random.randint(10,30) for _ in range(10)]

In [None]:
def generate_random_person(names, surnames, gender, salaries):
    return {"name":random.sample(names,1)[0],
            "surname":random.sample(surnames,1)[0],
            "gender":random.sample(gender,1)[0],
            "salary":random.sample(salaries,1)[0]}
def generate_people(k):
    return [generate_random_person(names, surnames, gender, salaries) for _ in range(k)]

In [None]:
generate_random_person(names, surnames, gender, salaries)

In [None]:
df = pd.DataFrame(generate_people(50),columns=["name","surname", "gender", "salary"])

In [None]:
df.to_csv("random_people.csv", index=False)

### Now we are going to use the data

In [None]:
df = pd.read_csv("random_people.csv")

In [None]:
#start getting a feel of the data
df.head(5)

In [None]:
df['salary'].value_counts()

In [None]:
df['salary'].median()

In [None]:
def tax(s):
    if s>=6000:
        return s*.7
    else:
        return s*.85

In [None]:
df["salary_after_tax"] = df["salary"].apply(tax)
df.head(5)

In [None]:
df_high = df[df["salary_after_tax"]>10000]

In [None]:
df_high

In [None]:
df.corr(numeric_only=True)

In [None]:
df.info()

## 2. Wordcloud of "The Complete Works of William Shakespeare"

Project Gutenberg’s [The Complete Works of William Shakespeare, by William
Shakespeare](https://www.gutenberg.org/files/100/100-0.txt)

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever.  You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org.  If you are not located in the United States, you’ll
have to check the laws of the country where you are located before using
this ebook.

In [None]:
import matplotlib.pyplot as plt
import nltk
from wordcloud import WordCloud, STOPWORDS

In [None]:
text = open("Shakespeare.txt",'r').read()
#import requests
#text = requests.get('https://www.gutenberg.org/files/100/100-0.txt').text

In [None]:
','.join(STOPWORDS)

In [None]:
text.strip().replace("\n", " ").replace("\r", " ");
print ("There are {} characters in the collection.".format(len(text)))

In [None]:
wordcloud = WordCloud(width=1600, height=800, stopwords=STOPWORDS, background_color='black').generate(text)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

In [None]:
# Ucomment to save the word cloud to a file.
#wordcloud.to_file("wordcloud.png")