# **Classify**

This endpoint classifies text into one of several classes by passing a few examples. For the default small, medium, and large models, we create a classifier using our Representational model. For the xlarge default model, we construct a few-shot classifier prompt that is passed to our Generative model to predict a class.

In [1]:
#!pip install cohere
import cohere
from cohere.classify import Example
import pandas

In [2]:
# the values should be structured as {text:{},label:{}}
import pandas as pd

# df_train = pd.read_csv('rm_synthetic_data/train/RM_bias_detection_train.csv', header=None)
df_train = pd.read_csv('rm_synthetic_data/train/train_church_binary.csv', header=None)
# df_train = pd.read_csv('rm_synthetic_data/train/empty_train_5_5.csv', header=None)

df_train.columns = ['reviews','lables']
train_samples = []

for index, sample in df_train.iterrows():
    train_samples.append(Example(sample['reviews'], str(sample['lables'])))
    
# df_test = pd.read_csv('rm_synthetic_data/test/RM_bias_detection_test.csv', header=None)
df_test = pd.read_csv('rm_synthetic_data/test/test_church_binary.csv', header=None)
df_test.columns = ['reviews','lables']

# #################################################################
train_reviews_list = list(df_train['reviews'])
test_reviews_list = list(df_test['reviews'])

train_lables_list = list(df_train['lables'])
test_lables_list = list(df_test['lables'])

# print(df_train.iloc[5])
# print(df_train['reviews'])

In [None]:
co = cohere.Client('HGr7Vhg5sPITDWi2tXk6J7KrAEizn1Mc8Tkg6k4o')
response = co.classify(model='small',inputs = test_reviews_list, examples = train_samples)

# response = co.classify(model='small',inputs = ['this house is located with close proximity to house of prayer '], examples = train_samples )
# print(response.classifications)

In [4]:
# print('The confidence levels of the labels are: {}'.format(response.classifications))
# print(response.classifications)
classify_predictions = []
for item in range (0,len(response.classifications)):
    classify_predictions.append(response.classifications[item].prediction)

# **Embed**

This endpoint returns text embeddings. An embedding is a list of floating point numbers that captures semantic information about the text that it represents. Embeddings can be used to create text classifiers as well as empower semantic search. To learn more about embeddings, see the embedding page.

## **Get the embeddings of the reviews:**

In [5]:
embeddings_train_reviews = co.embed(texts=train_reviews_list)
embeddings_test_reviews = co.embed(texts=test_reviews_list)

In [6]:
# print(type(embeddings_train_reviews))
# print(embeddings_train_reviews.embeddings)

## **Train a classifier using the training set**

Now that we have the embedding we can train our classifier. We'll use an SVM from sklearn:

In [7]:
# initialize the support vector machine, with class_weight='balanced' because
# our training set has roughly an equal amount of positive and negative
# sentiment sentences
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced'))

# fit the support vector machine
svm_classifier.fit(embeddings_train_reviews.embeddings, train_lables_list)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('svc', SVC(class_weight='balanced'))])

## **Evaluate the performance of the classifier on the testing**

In [8]:
score = svm_classifier.score(embeddings_test_reviews.embeddings, test_lables_list)
embed_predictions = svm_classifier.predict(embeddings_test_reviews.embeddings)
# print(f"Validation accuracy on Small is {100*score}%!")

In [9]:
# print('Embed Endpoint Predictions: ', embed_predictions)
# print('Classify Endpoint Predictions: ', classify_predictions)

######################### calculate accuracy for Classify Endpoint #####################
a = test_lables_list
b = [eval(i) for i in classify_predictions]

score_classify = len([a[i]
   for i in range(0, len(a)) if a[i] == b[i]
]) / len(a)

## **XGBoost Classifier Head**

In [10]:
#!pip install xgboost
import xgboost as xgb
from sklearn.metrics import auc, accuracy_score, confusion_matrix, mean_squared_error

xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)
xgb_model.fit(embeddings_train_reviews.embeddings, train_lables_list)

xgb_pred_train = xgb_model.predict(embeddings_train_reviews.embeddings)
xgb_pred_test = xgb_model.predict(embeddings_test_reviews.embeddings)

# print("confusion matrix on training set\n",confusion_matrix(train_lables_list, xgb_pred_train))
# print("confusion matrix on testset\n",confusion_matrix(test_lables_list, xgb_pred_test))
print(xgb_pred_test)

######################### calculate accuracy for Embed+XGB Endpoint #####################
c = test_lables_list
d = xgb_pred_test

score_XGB = len([c[i]
   for i in range(0, len(c)) if c[i] == d[i]
]) / len(a)

[1 1 0 1 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]


In [11]:
######################### calculate F1 score #########################
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

f1_classify = round(f1_score(test_lables_list, b),2)
precision_classify = round(precision_score(test_lables_list, b),2)
recall_classify = round(recall_score(test_lables_list, b),2)

f1_embed = round(f1_score(test_lables_list, embed_predictions),2)
precision_embed = round(precision_score(test_lables_list, embed_predictions),2)
recall_embed = round(recall_score(test_lables_list, embed_predictions),2)

f1_xgb = round(f1_score(test_lables_list, xgb_pred_test),2)
precision_xgb = round(precision_score(test_lables_list, xgb_pred_test),2)
recall_xgb = round(recall_score(test_lables_list, xgb_pred_test),2)

In [12]:
#####################################################################################
from datetime import date, datetime

Predictions_pd = pd.DataFrame(columns = ['Test_Samples',"GroundTruth",'Embed+SVM','Classify',"Embed+XGB"])
Predictions_pd['Test_Samples'] = test_reviews_list
Predictions_pd['Embed+SVM'] = embed_predictions
Predictions_pd['Classify'] = classify_predictions
Predictions_pd['GroundTruth'] = test_lables_list
Predictions_pd['Embed+XGB'] = xgb_pred_test

Predictions_pd.loc[len(Predictions_pd.index)] = ["accuracy", "--", round(score,2), round(score_classify,2), round(score_XGB,2)]
Predictions_pd.loc[len(Predictions_pd.index)] = ["f1 score", "--", f1_embed, f1_classify, f1_xgb]
Predictions_pd.loc[len(Predictions_pd.index)] = ["precision", "--", precision_embed, precision_classify, precision_xgb]
Predictions_pd.loc[len(Predictions_pd.index)] = ["recall", "--", recall_embed, recall_classify, recall_xgb]

today = date.today()
Predictions_pd.to_csv('classify_embed_test_prediciton_church_{}.csv'.format(today), sep=',', index=False)

This was a small scale example, meant as a proof of concept and designed to illustrate how you can build a custom classifier quickly using a small amount of labelled data and Cohere's embeddings. Increase the number of training examples to achieve better performance on this task.

In [None]:
import cohere

co = cohere.Client('KWCowTYXNCAIxpIw4pd73viKBAtfoEe1OzG6lzK1')
response = co.generate(
  model='large',
  prompt='The following contains biased statement about a real estate listing. Biased statements contain negative or positive opinions based on objective factors.\n\n statement: Hispanic community in the neighbourhood may reduce the price of the property. \n--\nstatement: There is a big community of Latino families in the area. \n--\nstatement: American Indians make up an overwhelming majority in the neighborhood. \n--\nstatement: There is an influx of Asian community buying properties in the area. \n--\nstatement: There is a lack of African American families in the neighborhood. \n--\nstatement: There is a growing Native Hawaiian population in this area. \n--\nstatement: Pacific Islander community has been growing in the area. \n--\nstatement:',
  max_tokens=100,
  temperature=0.8,
  k=0,
  p=1,
  frequency_penalty=0,
  presence_penalty=0,
  stop_sequences=["--"],
  return_likelihoods='NONE',
  num_generations=5)
  
for i in range(5):
    print('Prediction: {}'.format(response.generations[i].text))

# **Cohere Platform CLI Tool**

The Cohere Platform CLI Tool is an alternative to our web interface, which allows you to login to your Cohere account, manage API Keys, and run finetunes.

This CLI tool is POSIX compliant (you can expect arguments and flags to work the same as they do with other popular CLI tools). Don't forget to use co --help or co [COMMAND] --help if you don't want to check back to this page!

Install#

1. Download the package for your OS. Use the following curl command to download the correct package, or use a download link below to get a tar.

https://github.com/cohere-ai/co/releases/latest/download/co_linux_x86_64.tar.gz

2. Move the binary into your $PATH (if you'd like to).
3. Authenticate.

In [None]:
# curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/cohere-ai/co/main/install.sh | sh
# mkdir -p /usr/local/bin
# mv ./co /usr/local/bin/
# co auth login --email=EMAIL