# Soscipy
Soscipy is a python library to simplify working with data specially in social sciences. While there are several packages out there, I have personally found it difficult to find out the right library to stick to and the right recepies to use. Unlike other domain where computational methods have seen a rapid growth, social sciences remain a relatively unexplored area. This is first of the 4 tutorials which will explore data analysis in education. 

There are four parts to soscipy:
- **Data Analysis** : Aims to make rapid analysis easy while not compromising on any functionalities and extendability
- **Data Processing** : Makes common actions with structured data easy and accessible without needing expertiese in computer science
- **Data Visualisation** : Rapid visualisations while ensuring that the output is publication quality
- **Utilities** : A set of utilities that you can plug and play to make your workflow easy

### Data Analysis

There are four types of structured datasets that is majorly dealt with in social science:
- Time series
- Microdata
- Geospatial

In this notebook we will see an example of all of these four data types, how to do basic EDA using soscipy and how to do quick visualisations

### 1. Problem statement
We want to analyse the relationship between the countries expenditure on education and their income inequality. We will import data from worldbank using soscipy dataloader. We will enrich the data with some of the common economic indicators and then we will do a regression analysis and visualise it to see the relationship between these two indicators.

In [502]:
max_matrix_loc??

In [319]:
def matrix_max_val_loc(mat):
    return np.unravel_index(np.argmax(candidates, axis=None), candidates.shape)

def intersection_count(l1,l2):
    return set(l1).intersection(set(l2))

def prim_key_candidate(dataframe):
    val = {}
    for i,c in enumerate(dataframe):
        val[i] = len(dataframe[c].unique())
    temp = sort_dict(val)
    temp = list(temp.keys())[-3:]
    temp.sort()
    return temp


def sort_dict(dictionary):
    """
    Takes a dictionary as an input and sorts the dictionary
    :param dictionary: a key value pair dictionary
    :return: a sorted dictionar
    """
    return dict(sorted(dictionary.items(), key=lambda item: item[1]))


def invert_dict(dictionary):
    """
    Takes a dictionary as input and reverse the key value pair
    :param dictionary: a key value pair dictionary
    :return: a sorted dictionary
    """
    return {v: k for k, v in dictionary.items()}

def get_primary_keys(df1,df2):
    candidates = np.zeros((3,3))
    temp1 = prim_key_candidate(df1)
    temp2 = prim_key_candidate(df2)
    for i in range(len(temp1)):
        for j in range(len(temp2)):
            l1 = df2[df2.columns[temp1[i]]].values
            l2 = df1[df1.columns[temp2[j]]].values
            candidates[i][j] = len(intersection_count(l1,l2))
    if candidates.max() > 0:
        return max_matrix_loc(candidates)
    else:
        return (-1,-1)

def lookup(string,matches_df):
    val = matches_df[matches_df['left_side']==string]['right_side']
    if len(val) >0:
        return val.values[0]
    else:
        return string

def combine(df1,df2):
    left_on,right_on = get_primary_keys(df1,df2)
    list1 = list(df1[df1.columns[left_on]])
    list2 = list(df2[df2.columns[right_on]])
    primary_key_joins = string_matcher(list1,list2)
    matched_list = primary_key_joins.get_matched_list()
    matched_list = matched_list[matched_list.similairity<0.99]
    df2[df2.columns[right_on]] = df2[df2.columns[right_on]].apply(lambda x: lookup(x,matched_list))
    temp = pd.merge(df1,df2,left_on=df1.columns[left_on],right_on=df2.columns[right_on])
    return temp

In [1]:
!pip install soscipy



**Fetching data**
- We will visit the World Bank data page and look for the datafile. Use this URL: https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?view=chart

In [87]:
import pandas as pd
from soscipy.process import dfops
from soscipy.fetch.wb import world_bank_data

In [89]:
dfops.string_matcher

soscipy.process.dfops.string_matcher

In [90]:
f1 = '/Users/saurabhkarn/PycharmProjects/kornect/test_data/rangin_justicehub-file.xlsx'
f2 = '/Users/saurabhkarn/PycharmProjects/kornect/test_data/gyan_data.csv'

In [91]:
df1 = pd.read_excel(f1)
df2 = pd.read_csv(f2)

In [325]:
candidates = np.zeros((3,3))
temp1 = prim_key_candidate(df1)
temp2 = prim_key_candidate(df2)
for i in range(len(temp1)):
    for j in range(len(temp2)):
        l1 = df2[df2.columns[temp1[i]]].values
        l2 = df1[df1.columns[temp2[j]]].values
        candidates[i][j] = len(intersection_count(l1,l2))

In [380]:
left_on,right_on = get_primary_keys(df1,df2)

In [467]:
list1 = list(df1[df1.columns[left_on]])
list2 = list(df2[df2.columns[right_on]])

In [468]:
primary_key_joins = string_matcher(list1,list2)

In [470]:
matched_list = primary_key_joins.get_matched_list()

In [501]:
combine(df1,df2)

Unnamed: 0,Judges,Date of Appointment,Whether died in office,Whether resigned from office,Date of Birth_x,Intended Date of Retirement,Cadre of Appointment,Parent High Court,Appointing Authority,Gender_x,...,"If yes, what type",Area of Practice 1 (in order mentioned in profile),Area of Practice 2,Area of Practice 3,Area of Practice 4,Area of Practice 5,Area of Practice 6,Area of Practice 7,Area of Practice 8,Area of Practice 9
0,Harilal Jekisundas Kania,1946-06-20,Yes,No,03-11-1890,1955-11-02,Hc-Bar,Bombay,british,Male,...,,,,,,,,,,
1,Mehr Chand Mahajan,1948-10-04,No,No,23-12-1889,1954-12-22,Hc-Bar,Lahore,executive,Male,...,Constitutional Adviser to His Highness the Mah...,Civil,Constitution,,,,,,,
2,Bijan Kumar Mukherjea,1948-10-14,No,Yes,15-08-1891,1956-08-14,Hc-Bar,Calcutta,executive,Male,...,Senior Government Pleader,Publication Problems Law,,,,,,,,
3,Sudhi Ranjan Das,1950-01-20,No,No,01-10-1894,1959-09-30,Hc-Bar,Calcutta,executive,Male,...,,,,,,,,,,
4,Bhuvneshwar Prasad Sinha,1954-12-03,No,No,01-02-1899,1964-01-31,Hc-Bar,Patna,executive,Male,...,Assistant Government Advocate,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78,Surya Kant,2019-05-24,No,No,1962-02-10 00:00:00,2027-02-09,Hc-Bar,Punjab and Haryana,collegium,Male,...,Advocate General,Constitutional,Service,Civil,,,,,,
79,Krishna Murari,2019-09-23,No,No,1958-07-09 00:00:00,2023-07-08,Hc-Bar,Allahabad,Collegium,Male,...,Standing Counsel of UP.....,Civil,Constitutional,Company,Service,Revenue,,,,
80,Shripathi Ravindra Bhat,2019-09-23,No,No,1958-10-21 00:00:00,2023-10-20,Hc-Bar,Delhi,Collegium,Male,...,,Public,Employment,Education,Constitutional,,,,,
81,V. Ramasubramanian,2019-09-23,No,No,1958-06-30 00:00:00,2023-06-29,Hc-Bar,Telengana,Collegium,Male,...,,,,,,,,,,


In [499]:
len(temp)

83

In [466]:
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import sparse_dot_topn.sparse_dot_topn as ct
from scipy.sparse import csr_matrix

class string_matcher():
    def __init__(self, list1, list2, top_n=10, similarity=0.8):
        self.list1 = list1
        self.list2 = list2
        self.names = self.list1 + self.list2
        self.top_n = top_n
        self.similarity = similarity

    def ngrams(self, string, n=3):
        string = re.sub(r'[,-./]|\sBD', r'', string)
        ngrams = zip(*[string[i:] for i in range(n)])
        return [''.join(ngram) for ngram in ngrams]

    def awesome_cossim_top(self,A, B, ntop, lower_bound=0):
        # force A and B as a CSR matrix.
        # If they have already been CSR, there is no overhead
        A = A.tocsr()
        B = B.tocsr()
        M, _ = A.shape
        _, N = B.shape

        idx_dtype = np.int32

        nnz_max = M * ntop

        indptr = np.zeros(M + 1, dtype=idx_dtype)
        indices = np.zeros(nnz_max, dtype=idx_dtype)
        data = np.zeros(nnz_max, dtype=A.dtype)

        ct.sparse_dot_topn(
            M, N, np.asarray(A.indptr, dtype=idx_dtype),
            np.asarray(A.indices, dtype=idx_dtype),
            A.data,
            np.asarray(B.indptr, dtype=idx_dtype),
            np.asarray(B.indices, dtype=idx_dtype),
            B.data,
            ntop,
            lower_bound,
            indptr, indices, data)

        return csr_matrix((data, indices, indptr), shape=(M, N))

    def get_matches_df(self,sparse_matrix, name_vector, top=100):
        non_zeros = sparse_matrix.nonzero()
        sparserows = non_zeros[0]
        sparsecols = non_zeros[1]

        if top:
            nr_matches = top
        else:
            nr_matches = sparsecols.size

        left_side = np.empty([nr_matches], dtype=object)
        right_side = np.empty([nr_matches], dtype=object)
        similairity = np.zeros(nr_matches)

        for index in range(0, nr_matches):
            left_side[index] = name_vector[sparserows[index]]
            right_side[index] = name_vector[sparsecols[index]]
            similairity[index] = sparse_matrix.data[index]

        return pd.DataFrame({'left_side': left_side,
                             'right_side': right_side,
                             'similairity': similairity})

    def get_matched_list(self):
        vectorizer = TfidfVectorizer(min_df=1, analyzer=self.ngrams)
        tf_idf_matrix = vectorizer.fit_transform(self.names)
        matches = self.awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), self.top_n, self.similarity)
        matches_df = self.get_matches_df(matches, self.names, top=len(self.names))
        return matches_df

In [372]:
from dateutil.parser import ParserError as dparse

def convert_pd_datetime(series):
    try:
        return pd.to_datetime(series)
    except:
        return -1

In [336]:
df2[df2.columns[temp1[0]]]

0      Sharad Arvind Bobde
1              N.V. Ramana
2             R.F. Nariman
3         Uday Umesh Lalit
4          A.M. Khanwilkar
              ...         
242      Arjan Kumar Sikri
243    Abhay Manohar Sapre
244           Deepak Gupta
245           R. Bhanumati
246            Arun Mishra
Name: Name of Judge, Length: 247, dtype: object

In [323]:
get_primary_keys(df1,df2)

(0, 0)

In [324]:
pd.merge()

In [4]:
#Fetching data from worldbank
data_url = "https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS?view=chart"

In [5]:
data = world_bank_data(data_url)
data

Unnamed: 0,Country,Year,"Government expenditure on education, total (% of GDP)"
0,Arab World,2019,
1,Caribbean small states,2019,5.16919
2,Central Europe and the Baltics,2019,
3,Early-demographic dividend,2019,
4,East Asia & Pacific,2019,
...,...,...,...
259,Virgin Islands (U.S.),2019,
260,West Bank and Gaza,2019,
261,"Yemen, Rep.",2019,
262,Zambia,2019,


In [9]:
import world_bank_data as wb

In [84]:
def get_indicator(url):
    indicator = url.split('?')[0].split('/')[-1]
    return indicator
def world_bank_data(url, date, mrv=2):
    """
    Takes a URL for input and extracts the indicator string. This is then used to extract data from world bank data
    :param url: URL of the data page
    :return: Dataframew with indicator as the last column
    """
    indicator = get_indicator(url)
    data = wb.get_series(indicator, date=date,mrv=mrv).to_frame().reset_index()
    series = data['Series'].unique()[0]
    data = data.drop(['Series'], axis=1)
    data.Year = data.Year.apply(lambda x: int(x))
    data = dfops.rename_pd(data, [data.columns[-1]], [series])
    return data

https://wbdata.readthedocs.io/en/stable/#:~:text=Wbdata%20is%20a%20simple%20python,for%20searching%20and%20retrieving%20information.

In [85]:
data = world_bank_data(data_url,date='2018:2020')
data

Unnamed: 0,Country,Year,"Government expenditure on education, total (% of GDP)"
0,Arab World,2018,
1,Arab World,2019,
2,Caribbean small states,2018,5.41478
3,Caribbean small states,2019,5.16919
4,Central Europe and the Baltics,2018,
...,...,...,...
523,"Yemen, Rep.",2019,
524,Zambia,2018,4.61809
525,Zambia,2019,
526,Zimbabwe,2018,5.87135


In [65]:
for Y in data.Year.unique():
    data[str(Y)] = data[data.Year == Y][data.columns[-1]]

In [58]:
data[str(Y)] = data[data.Year == Y][data.columns[-1]]

In [59]:
data

Unnamed: 0,Country,Year,"Government expenditure on education, total (% of GDP)",2018
0,Arab World,2018,,
1,Arab World,2019,,
2,Arab World,2020,,
3,Caribbean small states,2018,5.41478,5.41478
4,Caribbean small states,2019,5.16919,
...,...,...,...,...
787,Zambia,2019,,
788,Zambia,2020,,
789,Zimbabwe,2018,5.87135,5.87135
790,Zimbabwe,2019,,


### Installation 

In [None]:
!pip install --upgrade soscipy