## ARM and Networking

1. Introduction
2. Theory
3. Methods
4. Results
5. Conclusions
6. reference


# Introduction

### Association rule mining (ARM) is a technique used to discover relationships among a large set of variables in a data set. It has been applied to a variety of industry settings and disciplines but has, to date, not been widely used in the social sciences, especially in education, counseling, and associated disciplines.  And in this part, we want to apply it to our dataset, the tweets with the keywork of new graduate to explore the relationships among this data set.

# Theory

### In data mining and machine learning, association rules are a common unsupervised learning algorithm. Different from the classification and clustering algorithms we have learned before, the main purpose of this type of algorithm is to explore the association between the inherent structural features (i.e. variables) of data.

### To put it simply, it is to find some meaningful and valuable relationships in large-scale data sets. With these relationships, on the one hand, we can broaden our understanding of data and its characteristics; On the other hand, it can realize the construction and application of the recommendation system (such as shopping basket analysis).

### After we have a basic understanding of association rules, we further subdivide them. Taking the relevance in daily life as an example, among the customers shopping in supermarkets, those who buy bread will buy milk to a large extent. This kind of relevance is called simple association rules; For another example, many customers who buy car sun visors will buy zero glass water in the near future. Such cases not only reflect the relationship between things, but also have a chronological order. Therefore, this kind of association is called sequential association rules.


### Here we hope to explore our last four questions through ARM. We now have many tweets about work, job search, fresh graduates and retirement. We need to summarize these tweets to get four text files to explore people's emotional attitudes from a comprehensive perspective.

# Methods

#### First, we should import some packages

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from apyori import apriori
import networkx as nx 

### The following code is needed to read, clean, and convert the tweets into a format suitable for ARM

In [2]:
# tweets with the keyword of "new graduate"
df = pd.read_csv("data/01-modified-data/textcleaning_py1.csv")
df = df[["id","favorited","retweeted","Clean_Text"]]
df = df.dropna()
tweets = df["Clean_Text"]

### Utility function: Re-format output

In [3]:
from apyori import apriori
import pandas as pd 

def reformat_results(results):

    #CLEAN-UP RESULTS 
    keep=[]
    for i in range(0,len(results)):
        # print("=====================================")
        # print(results[i])
        # print(len(list(results[i])))
        for j in range(0,len(list(results[i]))):
            # print(results)
            if(j>1):
                for k in range(0,len(list(results[i][j]))):
                    if(len(results[i][j][k][0])!=0):
                        #print(len(results[i][j][k][0]),results[i][j][k][0])
                        rhs=list(results[i][j][k][0])
                        lhs=list(results[i][j][k][1])
                        conf=float(results[i][j][k][2])
                        lift=float(results[i][j][k][3])
                        keep.append([rhs,lhs,supp,conf,supp*conf,lift])
                        # keep.append()
            if(j==1):
                supp=results[i][j]

    return pd.DataFrame(keep, columns =["rhs","lhs","supp","conf","supp x conf","lift"])


### Utility function: Convert to NetworkX object

In [4]:
def convert_to_network(df):
    print(df)

    #BUILD GRAPH
    G = nx.DiGraph()  # DIRECTED
    for row in df.iterrows():
        # for column in df.columns:
        lhs="_".join(row[1][0])
        rhs="_".join(row[1][1])
        conf=row[1][3]; #print(conf)
        if(lhs not in G.nodes): 
            G.add_node(lhs)
        if(rhs not in G.nodes): 
            G.add_node(rhs)

        edge=(lhs,rhs)
        if edge not in G.edges:
            G.add_edge(lhs, rhs, weight=conf)

    # print(G.nodes)
    # print(G.edges)
    return G

### Utility function: Plot NetworkX object

In [5]:
def plot_network(G):
    #SPECIFIY X-Y POSITIONS FOR PLOTTING
    pos=nx.random_layout(G)

    #GENERATE PLOT
    fig, ax = plt.subplots()
    fig.set_size_inches(5, 5)

    #assign colors based on attributes
    weights_e 	= [G[u][v]['weight'] for u,v in G.edges()]

    #SAMPLE CMAP FOR COLORS 
    cmap=plt.cm.get_cmap('Blues')
    colors_e 	= [cmap(G[u][v]['weight']/5.0) for u,v in G.edges()]

    #PLOT
    nx.draw(
    G,
    edgecolors="black",
    edge_color=colors_e,
    node_size=2000,
    linewidths=2,
    font_size=8,
    font_color="white",
    font_weight="bold",
    width=weights_e,
    with_labels=True,
    pos=pos,
    ax=ax
    )
    ax.set(title='Color and size plotted by attribute')
    # ax.set_aspect('equal', 'box')
    # plt.colorbar(cmap)

    # fig.savefig("test.png")
    plt.show()

In [6]:
tweet = []
for i in range(500):
    a = tweets[i]
    alist = a.split(" ")
    tweet.append(alist)

# Results
### As we can see, there is a network plot, and we want to explore the relationship in tweets with keyword of new graduate. However, as the plot shows below, based on our text processing, we can not get a very useful result, but we can see that these words shows below has a lot of hidden relationship. However, here, we rerun the chuck in the below, but we didn't get the final plot because it depends on the parameter selection, and this process cost a lot of time, so this result doesn't show out here.

In [None]:
print("\n------tweets with the keyword of tweets------")
print("Transactions:",pd.DataFrame(tweet))
results = list(apriori(tweet, min_support=0.06, min_confidence=0.3, min_lift=3, min_length=1))     #RUN APRIORI ALGORITHM
pd_results=reformat_results(results)
print("Results\n",pd_results)
G=convert_to_network(pd_results)
plot_network(G)


------tweets with the keyword of tweets------
Transactions:         0                1       2         3        4         5          6   \
0       rt     scjchurch_en     new      york  feature      zion  christian   
1       rt  arynewsofficial    mbbs  graduate      set       new      world   
2       rt         tedchris  number    number     feel      like      world   
3       rt         namyrmya    wait        go       to  announce        new   
4       rt     concepttvnew    mbbs  graduate    hafiz  muhammad     waleed   
..     ...              ...     ...       ...      ...       ...        ...   
495  check              new  design    recent       uc  berkeley   graduate   
496     rt  arynewsofficial    mbbs  graduate      set       new      world   
497     rt  arynewsofficial    mbbs  graduate      set       new      world   
498     rt  arynewsofficial    mbbs  graduate      set       new      world   
499     rt  arynewsofficial    mbbs  graduate      set       new      

# Conclusion

### In the future exploration, we can focus on the connections between words shown on the figure and make some important explorations. Focusing on the words themselves, we can find some representative words in the context of a keyword, and there are words connected with many words, which is more and more intuitive than the information that word cloud brings us.

# Reference
### https://link.springer.com/article/10.3758/BF03193156