<a href="https://www.kaggle.com/code/mikedelong/course-catalog-as-graph?scriptVersionId=156880089" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd
usecols = ['index', 'Year', 'GenArea', 'Area', 'Field', 'Name', 'Description']
df = pd.read_csv(filepath_or_buffer='/kaggle/input/california-university-fields-of-study-distributi/Courses_Berkeley_2018-01-15.csv',
                 index_col=['index'], usecols=usecols)
df = df[usecols[1:]]
df.head()

Unnamed: 0_level_0,Year,GenArea,Area,Field,Name,Description
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1967,Engineering,Aeronautical Engineering,Aerospace Studies,Officer Basic Military Training,(6) Study of world military systems and basic...
1,2011,Professional,Health Sciences,Public Health,"Drugs, Health, and Society",(2) Two hours of lecture and one hour of disc...
2,2011,Professional,Health Sciences,Public Health,"Policy, Planning , and Evaluation of Health Pr...",Three hours of lecture/discussion per week. T...
3,2011,Professional,Health Sciences,Public Health,Cognitive Science C1 Molecularand Cell Biology...,The course will survey the field of the human...
4,1971,Engineering,Aeronautical Engineering,Aerospace Studies,Officer Advanced Military Training,(3) Four weeks advanced officer training cond...


Let's just load the data we're going to use, and let's put the columns in order from most general to most specific. 

In [2]:
from plotly.express import histogram
histogram(data_frame=df, x='Year')

The number of courses increases steadily if not monotonically.

In [3]:
df.nunique()

Year               98
GenArea             6
Area               90
Field             306
Name            60402
Description    214230
dtype: int64

In [4]:
print(df['Year'].value_counts()[-5:].to_dict(), df['Year'].value_counts()[:5].to_dict())

{1903: 638, 1901: 610, 1904: 598, 1909: 585, 1900: 528} {2011: 7305, 1994: 6927, 1991: 6882, 2009: 6877, 1995: 6702}


The university was at its simplest in 1900 with 528 courses offered, and at its most complex in 2011 with 7305 courses offered.

Let's build some helper functions that will build simple graphs and display them.

In [5]:
import networkx as nx

def build(input_df: pd.DataFrame) -> nx.Graph:
    result = nx.Graph()
    result.add_node(node_for_adding='Berkeley')
    # all general areas are connected to the root
    for genarea in input_df['GenArea'].unique().tolist():
        result.add_edge(u_of_edge='Berkeley', v_of_edge=genarea)
    # build out the rest of the graph 
    for pair in [('GenArea', 'Area'), ('Area', 'Field'), ('Field', 'Name')]:
        for index, row in input_df[[pair[0], pair[1]]].drop_duplicates(ignore_index=True).iterrows():
            result.add_edge(u_of_edge=row[pair[0]], v_of_edge=row[pair[1]])
    return result

print('built graph building function')

built graph building function


In [6]:
import networkx as nx
import plotly.graph_objects as go

COLORSCALE = ['Greys', 'YlGnBu', 'Greens', 'YlOrRd', 'Bluered', 'RdBu', 'Reds', 'Blues', 'Picnic',
              'Rainbow', 'Portland', 'Jet', 'Hot', 'Blackbody', 'Earth', 'Electric', 'Viridis']


def show(G: nx.Graph, colorscale: str, title: str) -> go.Figure:
    # todo introduce other layouts
    positions = nx.spring_layout(G)
    edge_x = []
    edge_y = []
    for edge in G.edges:
        x0, y0 = positions[edge[0]]
        x1, y1 = positions[edge[1]]
        edge_x.append(x0)
        edge_x.append(x1)
        edge_x.append(None)
        edge_y.append(y0)
        edge_y.append(y1)
        edge_y.append(None)

    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(width=0.5, color='#888'),
        hoverinfo='none',
        mode='lines')

    node_x = []
    node_y = []
    for node in G.nodes:
        x, y = positions[node]
        node_x.append(x)
        node_y.append(y)

    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers',
        hoverinfo='text',
        marker=dict(
            showscale=True,
            colorscale=colorscale,
            reversescale=True,
            color=[],
            size=10,
            colorbar=dict(
                thickness=15,
#                 title=title,
                xanchor='left',
                titleside='right'
            ),
            line_width=2))
    node_adjacencies = []
    node_text = []
    for node, adjacencies in enumerate(G.adjacency()):
        node_adjacencies.append(len(adjacencies[1]))
        # todo
#         node_text.append('# of connections: '+str(len(adjacencies[1])))
        node_text.append(adjacencies[0])

    node_trace.marker.color = node_adjacencies
    node_trace.text = node_text
    
    fig = go.Figure(data=[edge_trace, node_trace],
             layout=go.Layout(
#                 title='<br>Network graph made with Python',
                titlefont_size=16,
                showlegend=False,
                hovermode='closest',
                margin=dict(b=20,l=5,r=5,t=40),
                annotations=[ dict(
#                     text="Python code: <a href='https://plotly.com/ipython-notebooks/network-graphs/'> https://plotly.com/ipython-notebooks/network-graphs/</a>",
                    showarrow=False,
                    xref="paper", yref="paper",
                    x=0.005, y=-0.002 ) ],
                xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                )
    return fig

print('built graph visualization function')

built graph visualization function


In [7]:
from arrow import now
time_start = now()
show(G=build(input_df=df[df['Year']==1900]), colorscale='Picnic', title='title').show()
print('1900: {}'.format(now() - time_start))

1900: 0:00:02.552166


In [8]:
# 2011 takes closer to two minutes but it seems like forever
time_start = now()
show(G=build(input_df=df[df['Year'] == 2011]), colorscale='Portland', title='title').show()
print('2011: {}'.format(now() - time_start))

2011: 0:01:30.671510
